A universal compression theory:
Lottery ticket hypothesis and superpolynomial scaling laws
October 01, 2025
When training large-scale models, the performance typically scales with the number of parameters and the dataset size according to a slow power law. A fundamental theoretical and practical question is whether comparable performance can be achieved with significantly smaller models and substantially less data. In this work, we provide a positive and constructive answer. We prove that a generic permutation-invariant function of \(d\) objects can be asymptotically compressed into a function of \(\operatorname{polylog} d\) objects with vanishing error. This theorem yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged. (Ia) directly establishes a proof of the dynamical lottery ticket hypothesis, which states that any ordinary network can be strongly compressed such that the learning dynamics and result remain unchanged. (Ib) shows that a neural scaling law of the form \(L\sim d^{-\alpha}\) can be boosted to an arbitrarily fast power law decay, and ultimately to \(\exp(-\alpha' \sqrt[m]{d})\).
Training contemporary AI models has become extremely costly. Modern models are very large and are often trained on enormous datasets. For example, GPT-4 is believed to have on the order of a trillion (\(10^{12}\)) parameters and to be trained on roughly a trillion tokens. Training runs can occupy clusters comparable in scale to an entire data center. In stark contrast, the brain—a comparable biological computer—appears to require far less data. A rough back-of-the-envelope estimate illustrates the gap: even if the auditory system received one word per second continuously, by age ten a child would have heard about \(10^8\) words, by which time most children have mastered their native language. This difference of roughly four orders of magnitude in data efficiency between artificial and biological systems highlights a central challenge that current AIs may not be using data optimally.
The data efficiency of large AI models is often summarized by neural scaling laws (NSL), in which the error \(L\) decays approximately as a power of the dataset size (holding other factors fixed): \[L(N) \propto N^{-\alpha},\] and empirical values of \(\alpha\) for large language models typically lie between \(0.1\) and \(0.3\) [1]. If we assume \(\alpha=0.1\) and that current models already achieve human-like language capability, then attaining the same capability with only \(10^8\) tokens would require a modestly larger exponent, roughly \(\alpha \approx 0.15\). Hence, even a small increase in the scaling exponent could substantially reduce training cost by bringing models closer to human-level data efficiency. Yet we currently lack principled guidance on whether, and by how much, the neural scaling laws can be improved.
In this work, we prove a universal result that, under suitable definitions of model width, enables sub-exponential compression (i.e., from \(d\) objects to \(\operatorname{polylog}(d)\)) of essentially arbitrary neural networks and/or their training data, thereby opening the possibility of substantially improving a wide range of neural scaling laws. Our contributions are:
Proof of a universal compression theorem, showing that almost any symmetric function of \(d\) elements can be compressed to a function with \(O(\operatorname{polylog}(d))\) elements losslessly (Section 4);
Application to network compression, which leads to a proof of what we call the dynamical lottery ticket hypothesis (LTH), which states that a large network can be compressed such that its training dynamics is the same as the original (Section 5);
Application to compress datasets, which is a proof-of-concept showing that one can improve neural scaling laws significantly (Section 6).
Figure 1 provides a schematic of the main idea. All proofs appear in the Appendix; numerical experiments are presented alongside the related theoretical results.
We begin with two motivating examples to illustrate the ubiquity of permutation symmetry in machine learning, and then introduce the permutation-symmetric functions we study.
The simplest form of permutation symmetry exists among data points. The loss function we minimize is \(L (\theta, \{z_i\}_i^d)\), where \(d\) is the number of data points, and \(z_i=(x_i, y_i)\) is a data point consisting of input–label pairs. The loss function is the average of a per-sample loss \(\ell\) over training data: \[L = \frac{1}{d}\sum_{i=1}^d\ell(x_i,y_i,\theta) \equiv \mathop{\mathbb{E}}_z[\ell(x,y,\theta)].\] Because the loss function \(L\) is a sum over the same function \(\ell\) of each data point, permuting any pair of these data points results in the same loss function for any \(\theta\).
Permutation symmetry in model parameter space depends on the structure of the model submodules. Consider the model \(f(x) = W_2 \sigma( W_1 x )\), where \(W_1\) and \(W_2\) are weight matrices, and \(\sigma\) is an element-wise non-linearity. Let \(w_i^T\) be the row vectors of \(W_1\), and \(v_i\) be the column vectors of \(W_2\). Then the output reads \[\label{eq:NN95output} f(x) = \sum_{i=1}^d v_i \sigma( w_i^T x )\tag{1}\] where \(d\) is the output dimension of \(W_1\), also the input dimension of \(W_2\). \(d\) is commonly known as the width of the neural network. The output is symmetric under the exchange of any pair of \((v_i, w_i) \leftrightarrow (v_j, w_j)\), another example of permutation symmetry. For a deep net, different layers can have multiple decoupled permutation symmetries. More generally, other modules that have such permutation symmetry include fully connected layers, attention logits in self-attention, and attention outputs between different attention heads; it also exists when ensembling many models with the same architecture [2], [3].
Generally, one can view the model or loss as a function of a set of permutation-symmetric inputs, while keeping other inputs implicit. We refer to each such input as an object, which may correspond to a data point or to a neuron weight \((w_i)\). In this paper, each object is embedded in \(m\) dimensions, and \(d\) is the number of such objects. We primarily consider the limit \(d \to \infty\).
Definition 1. Let each \(w_i \in V = \mathbb{R}^m\). \(f: V^d \to \mathbb{R}\) is called a symmetric function in \(\{w_i\}\) if, for any distinct \(i,j \in [d]\), \(f(\dots, w_i, \dots, w_j, \dots) \;=\; f(\dots, w_j, \dots, w_i, \dots)\).
We also use \(\theta = (w_1, \dots, w_d)\) as a collective notation for all objects. To analyze the error induced by compression, we impose a mild regularity assumption on \(f\). Specifically, it is known that any symmetric function admits a “deep set”–style universal representation [4] of the form \[\label{eq:deep95set95rep} f(w_1, \dots, w_d) \;=\; \rho \left( \sum_{i=1}^d g(w_i) \right),\tag{2}\] where \(\rho\) and \(g\) are suitable functions. Importantly, for smooth \(f\), one can choose \(\rho\) and \(g\) to be smooth as well [5]. We assume that \(\rho\) and \(g\) are Taylor-expandable with finite radii of convergence, and that neither \(\rho\) nor \(g\) depends on \(d\). Since our theory concerns compressing the objects while preserving the values of such symmetric functions, this framework is a unified approach to compressing either training data or any permutation-symmetric parameter set of a learning model.
Let \(V = \mathbb{R}^m\) denote the space in which each object \(w_i\) is embedded. \(\otimes\) represents tensor product. The shorthand \([d]\) refers to the index set \(\{1,2,\dots,d\}\), but when \(x\) is a non-integer real number, \([x]\) denotes its closest integer. \(S_d\) denotes the symmetric group on \(d\) elements. For a nonnegative weight vector \(\{c_i\}_{i=1}^d\), the support is defined as \(\operatorname{supp}(c_i) \equiv \{ i \in [d] \mid c_i \neq 0 \}\). For a set \(S \subseteq V\), the diameter is \(\operatorname{diam}(S) \equiv \max_{x,x'\in S} \|x-x'\|\), where we use the Euclidean norm throughout this paper. We write \(\mathcal{N}(\mu, \sigma^2)\) for the normal distribution with mean \(\mu\) and variance \(\sigma^2\). All other notations will be introduced in context.
Model and dataset compression has long been a central problem in AI [6]–[10]. Yet, almost no theoretical framework exists to explain why, or to what extent, such compression is possible. A primary conceptual framework is the lottery ticket hypothesis (LTH) [7], which posits that within every network there exists a small subnetwork that, when retrained, can achieve the same performance as the original. Several theoretical works have established variants of the LTH [11], [12]. However, these results typically fail to imply that the compressed model exhibits the same learning dynamics as the original—that is, that it reaches the same performance after training—which is arguably the most practical implication of the LTH. To date, the original formulation of the LTH remains unproven, precisely because of its dual requirement of both training and compression. We provide a more detailed discussion of this point in Sec. 5. Another closely related work is [13], which suggests the connection between symmetries and emergent sparsity during training but does not study active compression.
A major empirical guideline for training large language models (LLMs) is the neural scaling laws, which state that as the size of models and datasets increases, the generalization error decays as a power law: \(L \propto d^{-\alpha}\), with \(\alpha\) often small [1]. Such small exponents pose a central obstacle for scaling LLMs. For example, when \(\alpha = 0.1\), reducing the generalization error by half would require increasing the dataset size by a factor of \(1000\)—an impractical demand given today’s limited data availability. [8] suggests the possibility of improving scaling laws through data pruning; however, their theory applies only to linear regression and assumes knowledge of the ground-truth model. Whether scaling laws can be improved in more general settings, and without requiring access to the ground truth, remains unknown.
Compression is enabled by the observation that symmetric functions can be characterized by far fewer degrees of freedom than their apparent dimensions, which follows from a variant of the fundamental theorem of symmetric functions (FTSF). We then leverage this result to show that a family of compression algorithms ensures asymptotically lossless compression.
The value of a symmetric function does not depend on the specific ordering of \(\theta = (w_1, \dots, w_d)\), where each \(w_i \in \mathbb{R}^m\). As a simple example, the joint probability distribution of \(d\) i.i.d.random variables is symmetric in the sampled data points. This i.i.d.property underlies classical Shannon compression [14]. In this sense, permutation symmetry can be viewed as a natural generalization of independence (see, e.g., [15]), and it is thus unsurprising that the compression of i.i.d.variables extends to the more general setting of permutation-symmetric variables. The following theorem shows that it suffices to keep track of the tensorial statistical moments of \(\theta\), an idea traceable to Newton and Lagrange [16], and formalized in the fundamental theorem of symmetric polynomials (FTSP). The classical FTSP, however, applies only in the case \(m=1\), i.e., when each \(w_i \in \mathbb{R}\). The following version of the FTSP directly relates multivariate symmetric polynomials to their moments.
Theorem 1 (FTSP, Multivariate Variant). Let \(\theta = (w_1, \dots, w_d)\) with \(w_i \in \mathbb{R}^m\), and let \(f(\theta)\) be a polynomial in all scalar components \(w_{i,a}\). Then any symmetric function \(f(\theta)\) can be expressed as a function of the moments \(p_k\), \(k \in [d]\), defined by \[p_k = \frac{1}{d} \sum_i w_i^{\otimes k} \;\equiv\; \frac{1}{d} \sum_i \underbrace{w_i \otimes \dots \otimes w_i}_{k \,\text{repetitions}}.\]
If \(f(\theta)\) is a symmetric polynomial of degree \(k\), then \(f(\theta)\) is fully determined by the first \(k\) moments, and any change to \(\theta\) that preserves these moments leaves \(f(\theta)\) unchanged. In other words, the variables can be compressed into the size of their leading \(k\) moments as a linear space. The classical Tchakaloff theorem gives a direct upper bound on the number of elements required for the compression.
Theorem 2 ([17]). Let \(\mu\) be a measure supported on \(D \subset \mathbb{R}^m\). Then there exist \(N\) points \(w_j \in D\), with \(N \leq N_{m,k} \equiv \binom{m+k}{k}\), and positive weights \(c_j\), such that the first \(k\) moments are matched. That is, for all \(l\in \{0,1,\dots,k\}\), \[\int_D w^{\otimes l}\, d\mu(w) = \sum_{j=1}^N c_j \,{w_j}^{\otimes l}.\]
A proof follows from Carathéodory’s theorem in dimension \(N_{m,k}\) [18]. Note that \(N_{m,k}\) is precisely the dimension of the linear space of all moments up to order \(k\) (including a fictitious \(p_0\)). Algorithm 7 in Appendix 10 guarantees such a compression: whenever there are more than \(N_{m,k}\) weighted objects, it is always possible to reduce the support to at most \(N_{m,k}\) objects while preserving the first \(k\) moments.
In the spirit of Tchakoloff’s theorem, one can compress the original parameter set \(\theta\) into a smaller set of weighted parameter set \(\theta'\):
Definition 2. A weighted parameter set \(\theta'\) is defined as a collection of weight-parameter pairs \[\theta' = \{ (c_1, w_1), (c_2, w_2), \dots, (c_{d'}, w_{d'}) \} ,\] where each \(c_j\geq 0\) and \(w_j\in V = \mathbb{R}^m\). The moments of \(\theta'\) and the values of symmetric functions are defined as if there exist \(c_j\) copies of \(w_j\): \[p'_k = \frac{1}{\sum_{j=1}^{d'} c_j} \sum_{j=1}^{d'} c_j w_j^{\otimes k} , \quad f(\theta') = \rho\left( \sum_{j=1}^{d'} c_j g(w_j) \right) .\]
We remark that the original \(\theta\) can also be viewed as weighted, with unit weight for each object. A feature of our compression is that it does not change any of the \(w_i\)’s but instead adjusts the weights so that they are supported on a smaller subset. When \(f\) and \(\{w_i\}_{i\in[d]}\) are fixed, we may regard the output as a function only of the weights \(\{c_i\}_{i\in[d]}\). Specifically, if \(f_d(\theta) = \rho(\sum_i g(w_i))\), we denote \(\phi(c) = \rho\left( \sum_{i=1}^{d} c_i g(w_i) \right)\).
We are now ready to study approximating symmetric functions \(f\) that are Taylor-expandable. Expanding \(f\) to order \(k\) yields a symmetric polynomial of degree \(k\), and these moments can be matched with very few weighted objects. Consequently, the approximation error begins at order \((k+1)\).
Theorem 3 (Moment matching in a small ball). Let \(f\) be a symmetric function acting on a weighted parameter set \(\theta = \{(c_i, w_i)\}_{i\in [d]}\), and let \(\phi\) be its corresponding function acting on weights. Let \(r = \max_{i\neq j} \|w_i-w_j\|\). Then, there exists a \(\theta' = \{(c'_i, w_i)\}_{i\in [d]}\) such that no more than \(N_{m,k} \equiv \binom{m+k}{k}\) weights are nonzero, and \[\left| \phi(c') - \phi(c) \right| = O(dr^{k+1}).\]
Theorem 3 is a main ingredient of our compression theory. A lesson from this theorem is that dealing with a group of objects of small diameter is very effective in suppressing error, as we can choose a large enough \(k\) to greatly suppress the compression error.
Theorem 3 shows that compressing points in a small diameter enables a tight control on the error. By a sphere packing argument (see Lemma 1), if we put a lot (\(d\)) of objects in a finite region, then we can always find a subset of diameter \(O((N/d)^{1/m})\) containing \(N\) objects. This gives rise to a general strategy of compression in Algorithm 2, and any concrete algorithm that fits in this strategy is guaranteed to have vanishing compression error.
In terms of a practical compression algorithm, we first remark that Step 2 (moment matching) can be realized by Algorithm 7 in Appendix 10, although several other moment-matching algorithms exist, especially in the case of peeling many points in one cluster [19]. The most time-consuming part in Algorithm 7 is finding null vectors, so we estimate its time complexity as \(d N_{m,k}^2\), which tells us that matching higher moments takes a much longer time. For clustering, ideally in each iteration one wants to greedily find the \((N_{m,k}+1)\)-subset with the smallest diameter, but the \(k\)-nearest neighbors problem is known to be NP-hard in general. For clarity, in Theorem 4 of this section, we prove error bounds associated with the greedy strategy, indicating that a compression map satisfying our asserted error bounds universally exists. However, we perform numerical experiments using \(k\)-means clustering (see Appendix 10), which is much faster in practice.
The following theorem shows that with sufficiently large \(k\) (compared to \(m\), but still much smaller than \(d\)), one can compress the number of active objects to any power of \(d\) with vanishing error.
Theorem 4 (Universal Compression). Let \(\|w_i\|\leq R\) for all \(i\in[d]\). Let \(f\) be a symmetric function and \(\phi\) be the same function viewing weights as variables. We compress \(\theta = \{w_i\}_{i=1}^d\) into \(d'<d\) weighted objects \(\theta'\). By matching up to the \(k\)th moment,
the error is \[\label{eq:error95finite95k} \mathcal{E} = \left| \phi(c') - \phi(c) \right| = O\left( d \, (d')^{1-\frac{k+1}{m}} \right);\qquad{(1)}\]
\(d\) original objects can be compressed into \[\label{eq:dstop95finite95k} d' = \Omega\left( \left(\frac{d}{\omega(d)}\right)^{\frac{m}{k-m+1}} \right)\qquad{(2)}\] weighted objects, such that the error is no larger than \(\omega(d)\).
An implication of this theorem is that we can remove almost all objects. In fact, we can reach \(d'=d^{\sigma}\) with any \(0<\sigma<1\) with vanishing error. To see this, we insert \(d'=d^{\sigma}\) into Eq. ?? and get \[\label{eq:error95dsigma} \left| \phi(c') - \phi(c) \right| = O(d^{1+\sigma(1-\frac{k+1}{m})}).\tag{3}\] The error is vanishing by choosing \(k > (1+\sigma^{-1})m - 1\). In Eq. ?? , compressing up to order \(k\) is enabled by the fact that the function is at least \(C^k\)-differentiable, whereas the term \(m\) in the exponent is a typical signature of the curse of dimensionality. This competing effect of dimensionality and differentiability is reminiscent of common error scalings in nonparametric statistics [20].
Theorem 4 seems to imply that larger \(k\) always yields tighter error bounds. However, this stops being true when a cluster of \(N_{m,k}\) points becomes comparable to \(d\). A more careful analysis shows that the optimal choice of \(k\) is \(k_{\mathrm{opt}} = \Theta(d'^{1/m})\) (Theorem 7). It follows that there exists a moment matching algorithm such that the error is at most stretched-exponentially decaying as \(\exp(-\mathrm{const}\times \sqrt[m]{d})\); conversely, we can compress \(d\) original objects into \[\label{eq:opt95cp95rate} d' = O \left(\log^m \frac{d}{\omega(d)} \right)\tag{4}\] weighted objects, such that the error is no larger than \(\omega(d)\). Despite we showed that compressing to \(\operatorname{polylog}(d)\) is possible, it should be noted that when \(k=k_{\mathrm{opt}}\), \(N_{m,k}\) becomes comparable to \(d\), so compression becomes much more computationally expensive.
Figure 3 presents a direct numerical test of the compression error scaling predicted in Theorem 4. We study the following function: \[\label{eq:sigmoid95func} f(w_1, \dots, w_d) = \frac{1}{d} \sum_{i=1}^{d} \frac{1}{10}\sum_{a=1}^{10} \operatorname{sigmoid}(\langle w_i, x_a \rangle),\tag{5}\] where all \(w_i\) and \(x_a\) are \(m\)-dimensional vectors, and \(\langle \cdot, \cdot \rangle\) denotes the inner product. We compress a fixed fraction of the original dataset across different dimensions and with varying moment-matching orders. Specifically, we randomly sample vectors \(x_a\) and inputs \(\{w_i\}_{i\in [d]}\), perform compression, and average results over \(20\) trials to obtain each data point in Fig. 3 (a–d); error bars indicate standard deviation. Figure 3 (e) shows that the error decays approximately as \(\mathcal{E} \sim d^{-(k+1)/m-0.5}\). While Eq. 3 predicts an upper bound \(\mathcal{E} \sim d^{-(k+1)/m+2}\), the observed error lies well below this bound and shows a dependency on \(k\) and \(m\) that is similar to the theoretical bound.
Figure 4 shows the effectiveness of dataset compression in a standard neural network training task. We consider a supervised learning problem in a teacher–student setup. Specifically, the teacher \(f(x_1, x_2)\) is implemented as a random two-layer neural network of width \(50\) with ReLU activation. The student receives \(d\) data points of the form \((x_1, x_2, \mathcal{N}(f(x_1, x_2), 3^2))\), where \(x_1\) and \(x_2\) are uniformly sampled from \([-1,1]\), and the goal is to learn \(f(x_1,x_2)\) using an identical network architecture. The loss function is exactly symmetric in the dataset. For full-batch update rules, which depend on the gradient of the full-batch loss, any model prediction or performance metric is thus a symmetric function of the dataset. Consequently, if the student is given a compressed dataset of size \(d'\) derived from an original dataset of size \(d\), the performance approximates that of training on the full dataset, and typically surpasses that of training on \(d'\) naively subsampled data points (Fig. 4 (a,b)).
In practice, however, stochastic update rules based on mini-batches are more common. In this case, permutation symmetry holds only in an averaged sense. Nonetheless, we find that compression remains useful. Figures 4 (c,d) show the loss evolution with batch size \(256\), supporting this claim. Additional details are provided in Appendix 11.
The lottery ticket hypothesis [7] postulates that within any sufficiently wide neural network, one can find a subnetwork that, when trained in isolation, achieves performance comparable to the original. While widely observed, its theoretical understanding has remained elusive. This section proves a corollary of our compression theory: any layer of a neural network with width \(d\) can be asymptotically compressed, losslessly, to a size polylogarithmic in \(d\) such that the training dynamics before and after compression are identical. We refer to this statement as the dynamical LTH, which can be viewed as a stronger and more quantitative variant of the original hypothesis, which only claims that the final training performance of the compressed network is at the same level as the original.
The key idea behind the dynamical LTH is that predictions and loss functions are not only symmetric functions of the trained parameters, but can also be regarded as symmetric functions of the initial parameters. This holds because common update rules are equivariant, i.e., they commute with permutations (see Appendix 9). Let \(f\) denote a symmetric function, let \(\mathcal{T}\) denote the training dynamics (a mapping from initial parameters to trained parameters), and let \(\theta_0\) denote the initial network parameters. By equivariance, \(f(\mathcal{T}(\theta_0))\) is again a symmetric function of \(\theta_0\). Therefore, assuming that \(f \circ \mathcal{T}\) admits a finite radius of convergence, the moment-matching compression established earlier applies directly to \(f \circ \mathcal{T}\), leading to the dynamical LTH.
To make the dynamical LTH practically applicable, we must define a compressed dynamics \(\mathcal{T}'\), which maps compressed initial parameters to compressed trained parameters. The formal definition is given in Appendix 9, but in practice it is straightforward. For gradient-based update rules such as SGD or Adam, the compressed dynamics acting on \(\theta' = \{(c_j, w_j)\}_{j \in [d']}\) is identical to the original dynamics on \(\theta\), except that each gradient \(\partial L / \partial w_j\) is replaced by \({c_j}^{-1} \partial L / \partial w_j\). With the notions of equivariance and compressed dynamics, one can prove the dynamical LTH:
Theorem 5 (Dynamical lottery ticket hypothesis). Let \(\theta \in V^d\) be a set of permutation-symmetric trainable parameters of a neural network. Suppose the training dynamics \(\mathcal{T}\) is equivariant and the model prediction \(f(\theta)\) is symmetric. Then, for any initial parameter \(\theta_0\), there exists a compressed weighted parameter \(\theta'_0\), supported on \(d' = O (\log^m \tfrac{d}{\omega(d)})\) points, such that \[\big| f(\mathcal{T}'(\theta'_0)) - f(\mathcal{T}(\theta_0)) \big| \;=\; \omega(d) .\]
More specific error scaling of the dynamical LTH takes the same form as in Theorems 4 and 7. A subtlety of Theorem 5 is that compression produces a “weighted neural network.” However, such networks can be reduced to equivalent standard networks. To see this, recall the expression for the output in Eq. 1 . For a weighted network, this becomes \[f(x) = \sum_{j=1}^{d'} c_j v_j \,\sigma(w_j^T x).\] If we redefine the second-layer weights as \(v_j \leftarrow c_j v_j\), then the expression reduces exactly to the output of a standard neural network (by incorporating \(c\) into the outgoing weight).
Figure 5 directly validates the dynamical LTH. The task is to learn the function \(f(x_1,x_2)\) shown in Fig. 5 (a) from \(10^5\) data points of the form \((x_1, x_2, \mathcal{N}(f(x_1, x_2), 0.2^2))\), with \(x_{1/2}\) sampled uniformly from \([-1,1]\). Across a wide range of update rules and learning rates, the predictions of a wide network and its compressed counterpart are shown to be nearly indistinguishable throughout the training dynamics.
Theorems 4 and 7 shows that the predictions obtained from a large number of objects can be faithfully reproduced by a much smaller set of weighted objects. This observation can be leveraged to asymptotically improve neural scaling laws. A commonly used empirical form of neural scaling laws is [21] \[L(d) = L_0 + (d_0/d)^{\alpha},\] where \(L\) denotes the loss and \(d\) may represent dataset size or number of parameters, or similar resources. Our compression theory improves power-law scaling in dataset size or model parameters to at most stretched-exponential scaling. To see this, suppose we compress \(d\) objects into \(d' = d^\sigma\) objects, where \(0 < \sigma < 1\), then the difference between the loss of the original and the compressed training is limited by Eq. 3 , which is at most \(O(d^{1+\sigma(1-\tfrac{k+1}{m})})\). Choosing a sufficiently large \(k\), one can always ensure that the compression-induced error vanishes faster compared to \(O(d^{-\alpha})\), leading to an improved scaling: \[L(d') = L_0 + (d_0'/d')^{\alpha/\sigma}.\] In the same spirit, we can compress \(d\) to \(d'=O(\log^m d)\) (inducing arbitrarily fast power law error, which is negligible) so that any power-law scaling is improved to a stretched exponential scaling as \[L(d') = L_0 + d^{-\alpha} = L_0 + \exp(-\alpha' \sqrt[m]{d'}) .\]
We demonstrate improvements in scaling laws with respect to dataset size in Fig. 6 (a) and with respect to network width in Fig. 6 (b). The learning tasks in Figs. 6 (a) and 6 (b) are the same as those in Figs. 4 and 5, respectively; additional details are provided in Appendix 11. In both cases, compressing \(d\) objects to \([16\sqrt{d}]\) objects effectively doubles the scaling exponent. As a practical remark, the error bound in Eq. 3 guarantees sufficiently small compression error if only \(k \gtrsim 13\). However, in Fig. 6 we match only up to the sixth moment and still observe a quadratic speedup. Together with Fig. 3, this suggests that in practice the error scaling can be substantially faster than the worst-case upper bound.
In this work, we established a rigorous framework showing that any permutation-symmetric function admits strong asymptotic compression. A key advantage of the theory is that it is system-agnostic, meaning that it can be applied to any architecture or loss function. The theory has strong implications for deep learning and AI research. We showed that this result can be applied to understand the compressibility of both neural networks and datasets, a connection that has not been identified previously. This leads to a unified theory of compression for deep learning.
The central contribution of our theory is a proof of concept that it is theoretically possible to strongly compress neural networks and datasets, enabling far more efficient use of training data and model parameters. An important future direction is therefore to develop practical compression algorithms that can improve neural scaling laws at scale. We proposed a polynomial-time algorithm that achieves these scalings exactly, but it is currently too slow and memory-intensive in high dimensions. We expect future work to either optimize this algorithm or design scalable approximations. A potential limitation of moment-matching compression is slow error decay when the constituent dimension \(m\) is large, manifesting the familiar curse of dimensionality. However, many ostensibly high-dimensional objects actually lie near low-dimensional manifolds [22]. In particular, language data appear to have an effective dimension \(\sim 10\) [23]. These observations suggest that our compression framework can be extended to handle high-dimensional yet structured data, where exploiting low-dimensional embeddings may greatly mitigate the apparent limitation.
Our framework also suggests new directions beyond direct compression. In particular, it points to improved data sampling strategies and model initialization schemes. For example, our proof of the dynamical LTH can be interpreted as showing that a sufficiently well-initialized network may match the performance of a randomly initialized network orders of magnitude larger. The challenge is to construct initializations (or datasets) that behave as if they were already compressed. Intuitively, well-chosen objects should be weighted and roughly equidistant, making further compression difficult. Such strategies may be connected to importance sampling [24], [25] and orthogonal initialization [26]. On the theory side, many questions remain open. Extensions of our results could refine approximation rates in Barron space [27], [28]. Finally, while our framework is rooted in the symmetric group, generalizations to other group structures may offer further insights.
The authors thank Yizhou Xu for discussion. ILC acknowledges support in part from the Institute for Artificial Intelligence and Fundamental Interactions (IAIFI) through NSF Grant No. PHY-2019786. This work was also supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF - 1231216.
The following theorem is well known. We quote it and will use it the proof of Theorem 1.
Theorem 6 (Fundamental theorem of symmetric polynomials). For any symmetric polynomial \(f(x_1, \dots, x_d)\), where \(x_i\in\mathbb{R}\) for all \(i\in[d]\), there is a unique polynomial \(P\) such that \[f(x_1, \dots, x_d) = P(p_1, \dots, p_d) ,\] where \(p_k\) (\(k\in [d]\)) is called the \(k\)th moment and is defined as \[p_k = \frac{1}{d}\sum_{i=1}^d x_i^k .\]
An arguably more common statement of the FTSP is that any symmetric function is a function of power sums \(\sum_{i=1}^{d} x_i^k\) for \(k=0,1,\dots, d\), which differs from our statement by a normalization of \(d\). However, our convention fixes each moment \(p_k\) to constant order of magnitude. With our convention, we can treat \(P\) as dependent on \(d\).
Proof. We write the coordinates of \(w_i\) as \((w_{i,1}, \dots, w_{i,m})\). For convenience, we use the multi-index notation \(\alpha = (a_1, \dots, a_m)\), such that \(w_i^\alpha \equiv w_{i,1}^{a_1} \dots w_{i,m}^{a_m}\). For a multi-index \(\alpha\), \(|\alpha| \equiv a_1 + \dots + a_m\). Also, we define \(q_\alpha = \sum_{i=1}^d w_i^\alpha\) (we use \(q\) to distinguish from \(p\) with a \(1/d\) factor). Our proof is divided into a few steps: (1) Represent \(f\) in terms of \(q_\alpha\)’s (2) Repack \(\{q_\alpha\}\) into tensor moments \(\{p_k\}\) (3) Show that only \(p_k\) for \(k=1,\dots, d\) are independent (4) Show that the representation in terms of \(\{p_k\}\) is unique.
Generally, a polynomial \(f\) is linearly spanned by monomials in the form \(w_{i_1}^{\alpha_1} w_{i_2}^{\alpha_2} \dots w_{i_N}^{\alpha_N}\) (note that \(N\leq d\)), where all subscripts are distinct and each \(|\alpha_j| >0\). Imposing permutation symmetry \(S_d\), such monomials differing by index permutation must appear with the same coefficient. That is, the space of symmetric polynomials is linearly spanned by terms in the form \[\label{eq:general95sym95term} \sum_{\sigma\in S_d} w_{\sigma(1)}^{\alpha_1} w_{\sigma(2)}^{\alpha_2} \dots w_{\sigma(N)}^{\alpha_N} \propto \sum_{\mathcal{X}(i_1, \dots, i_N)} w_{i_1}^{\alpha_1} w_{i_2}^{\alpha_2} \dots w_{i_N}^{\alpha_N}\tag{6}\] Here, we introduced the notation \(\mathcal{X}(i_1, \dots, i_N)\) to denote the set of index lists \((i_1, \dots, i_N)\) such that all \(i_j\) are distinct and range from \(1\) to \(d\).
We prove by induction on \(N\) that all such terms can be written as polynomials of \(q_\alpha\)’s. When \(N=1\), Eq. 6 is simply \(q_\alpha\) by definition: \[\sum_{i=1}^d w_i^\alpha = q_\alpha .\] Now, suppose Eq. 6 is proven to be a polynomial of \(q_\alpha\)’s. Replacing \(N\to N+1\), we look at \[\label{eq:sym95term951} \sum_{\mathcal{X}(i_1, \dots, i_{N+1})} w_{i_1}^{\alpha_1} w_{i_2}^{\alpha_2} \dots w_{i_{N+1}}^{\alpha_{N+1}} .\tag{7}\] We use the set decomposition \[\begin{align} \mathcal{X}(i_1, \dots, i_{N}) \times \{i_{N+1}\}_{i=1}^d &= \mathcal{X}(i_1, \dots, i_{N+1}) \sqcup \mathcal{X}(i_1, \dots, i_{N}) \times \{i_1\} \\ &\phantom{=} \sqcup \dots \sqcup \mathcal{X}(i_1, \dots, i_{N}) \times \{i_N\} \end{align}\] to rewrite Eq. 7 as \[\begin{align} &\sum_{\mathcal{X}(i_1, \dots, i_{N+1})} w_{i_1}^{\alpha_1} w_{i_2}^{\alpha_2} \dots w_{i_{N+1}}^{\alpha_{N+1}} = \sum_{\substack{\mathcal{X}(i_1, \dots, i_{N}) \\ i_{N+1}}} w_{i_1}^{\alpha_1} w_{i_2}^{\alpha_2} \dots w_{i_{N+1}}^{\alpha_{N+1}} \\ &- \sum_{\mathcal{X}(i_1, \dots, i_{N})} w_{i_1}^{\alpha_1} w_{i_2}^{\alpha_2} \dots w_{i_{N}}^{\alpha_{N}} w_{i_{1}}^{\alpha_{N+1}} - \dots - \sum_{\mathcal{X}(i_1, \dots, i_{N})} w_{i_1}^{\alpha_1} w_{i_2}^{\alpha_2} \dots w_{i_{N}}^{\alpha_{N}} w_{i_{N}}^{\alpha_{N+1}} . \end{align}\] On the right-hand side, the first term can be written as \[\left( \sum_{\mathcal{X}(i_1, \dots, i_{N})} w_{i_1}^{\alpha_1} w_{i_2}^{\alpha_2} \dots w_{i_{N}}^{\alpha_{N}}\right) q_{\alpha_{N+1}} ,\] in which the parenthesis is a polynomial of \(q_\alpha\)’s by induction assumption. The second term can be written as \[\sum_{\mathcal{X}(i_1, \dots, i_{N})} w_{i_1}^{\alpha_1+\alpha_{N+1}} w_{i_2}^{\alpha_2} \dots w_{i_{N}}^{\alpha_{N}} ,\] which is a polynomial of \(q_\alpha\)’s as well, and so are the other terms in the ellipse. Therefore, we have proven that all terms in the form of Eq. 6 can be written as polynomials of \(q_\alpha\)’s.
Next, we repack \(q_{\alpha}\) into tensors. Define \(p_\alpha = q_\alpha/d\). Note that the set \(\{ p_{\alpha} \;| \;|\alpha|=k \}\) is exactly the set of coordinates of the tensor \(p_k = \sum_i w_i^{\otimes k}/d\). So indeed, the value of \(f\) relies only on the tensor moments \(p_k\).
To show that \(f\) only requires the first \(d\) moments, we try to represent any \(p_k\) (\(k>d\)) as a function of \(p_1,\dots, p_d\). \(p_k\) is a symmetric tensor of rank \(k\). It is easy to see that the space of symmetric \(k\)-tensors is isomorphic to the space of homogeneous polynomials of degree \(k\) (the argument is denoted as \(u\in\mathbb{R}^m\)), by the homomorphism \[T_{i_1,\dots, i_k} \to \sum_{i_1, \dots, i_k} T_{i_1,\dots, i_k} u_{i_1} \dots u_{i_k} .\] Specifically, \(p_k\) is mapped to \[s_k(u) = \frac{1}{d} \sum_{i=1}^N (u^\top w_i)^k\] Using Theorem 6 for the variables \(\{u^\top w_i \;| \;i=1,2,\dots,d \}\), there is a polynomial \(P\) such that \[s_k (u) = P(s_1(u), \dots, s_d(u)) .\] Because this holds for arbitrary \(u\), it follows that \(p_k\) is a function of \(p_1, \dots, p_d\) as well.
Finally, we show that the expression of \(f\) in terms of \(p_1, \dots, p_d\) is unique. Assume the contrary: \(f(\theta) = P(\{p_k\}) = P'(\{p_k\})\), where \(P\) and \(P'\) are unequal polynomials. This implies that \(P(\{p_k\}) - P'(\{p_k\}) = 0\), i.e., there is a polynomial relation among \(p_1, \dots, p_d\). However, there is no such relation, even for \(m=1\), as guaranteed by Theorem 6. ◻
Proof. Choose an \(m\)-dimensional ball of diameter \(r\) that contains all \(w_i\)’s, and let the center be \(w_0\in \mathbb{R}^m\). We first Taylor-expand \(f(\theta)\) around \(\theta_0=(w_0,\dots, w_0)\) up to the \(k\)th order. The expansion is a symmetric polynomial of degree \(k\), so it can be written as a function of \((p_1, \dots, p_k)\). By Theorem 2, we can find a another set of weights \(\{c'_i\}\) supported on \(N_{m,k} = \binom{m+k}{k}\) points, such that \[\sum_i c'_i (w_i-w_0)^{\otimes l} = \sum_i c_i (w_i-w_0)^{\otimes l} = p_l \quad \textrm{for }l=1,\dots, k\] Let \(\theta'=\{c'_i, w_i\}\), and denote its moments by \(\{p'_l\}\). By construction, we have \(p'_l=p_l\) for \(l=1,\dots, k\). Since the first \(k\) moments all match, there is no difference between \(\phi(c)\) and \(\phi(c')\) up to the \(k\)th order.
Next, we look at the \((k+1)\)th order in the Taylor expansion of \(f(\theta)\) around \(\theta_0\). It is written as \[\sum_{|\alpha|=k+1} \frac{1}{\alpha!} \partial_\alpha f(\theta_0) (\theta-\theta_0)^\alpha\] It is a degree-\((k+1)\) homogeneous symmetric polynomial, and we denote it as \(P_{k+1} (p_1, \dots, p_{k+1})\). Hence, \[f_d(\theta) - f_d(\theta') = P_{k+1} (p_1, \dots, p_{k+1}) - P_{k+1} (p_1, \dots, p_k, p'_{k+1}) .\] In \(P_{k+1} (p_1, \dots, p_{k+1})\), the only term that depends on \(p_{k+1}\) is linear in itself—\(\langle b_{k+1}, p_{k+1}\rangle\), where \(b_{k+1}\) is a \((k+1)\)-index symmetric coefficient tensor and here \(\langle \cdot, \cdot\rangle\) denotes tensor contraction; all other terms are completely determined by \(\{p_l\}_{l=1}^k\). To see that \(b_{k+1}\) is at most of order \(d\), we use the deep–set representation: \(f(\theta) = \rho ( \sum_{i\in[d]} g(w_i) )\). Then \(b_{k+1}\) can be written down explicitly as \[\begin{align} \langle b_{k+1}, p_{k+1}\rangle &= \frac{\partial f}{\partial y} \sum_{i=1}^d \left\langle a_{k+1}, (w_i-w_0)^{\otimes (k+1)} \right\rangle \\ &= d \left\langle \frac{\partial f}{\partial y} a_{k+1} , p_{k+1} \right\rangle , \end{align}\] where \(y=\sum_{i\in[d]} g(w_i)\), and \(\langle a_{k+1}, (w_i-w_0)^{\otimes (k+1)} \rangle\) is the \((k+1)\)th order in the Taylor expansion of \(g\) around \(w_0\). \(\partial f/\partial y\) is evaluated at \(\theta=\theta_0\). By our convention that \(\rho\) and \(g\) are independent of \(d\), we thus have \(b_{k+1} = O(d)\).
Also, since each \(\|w_j-w_0\| \leq r/2\) we have \(p_{k+1} = O(r^{k+1})\). Therefore, we conclude that \(f_d(\theta) - f_d(\theta') = O(dr^{k+1})\). ◻
This lemma is used in proving Theorems 4 and 7.
Lemma 1 (Sphere packing). Given \(d\gg 1\) objects in a closed \(m\)-dimensional ball of radius \(R\): that is, \(\theta=\{w_i\}_{i=1}^d\), \(\|w_i\|\leq R\). For any \(\theta\), the diameter of the smallest ball containing \(N\) points is at most of order \((N/d)^{1/m}\). That is, \[\sup_{\theta} \min_{\substack{S\subset\theta\\|S|=N}} \operatorname{diam}(S) = R \,O\left( (N/d)^{1/m} \right) .\]
Proof. We use \(B(x,r)\) to denote a closed \(m\)-dimensional ball centered at \(x\).
Cover \(B(0,R)\) by \(M\) balls of radius \(r\). By the pigeonhole principle, if \(d>(N-1)M\), some ball contains at least \(N\) points, giving an \(N\)-point subset of diameter \(\leq 2r\).
We show that there exists a covering with \(M=(1+2R/r)^m\) balls of radius \(r\). We choose a set of points \(\{x_j\}_{j=1}^M \subset B(0,R)\) to form a maximal \(r/2\)-packing of the ball \(B(0,R)\), meaning \[\|x_i-x_j\|\geq r, \forall i\neq j,\] and no further points can be added while maintaining this separation. By this definition, the balls \(\{B(x_j,r)\}\) form a covering of \(B(0,R)\), but the smaller balls \(\{B(x_j,r/2)\}\) are mutually disjoint and contained in \(B(0,R+r/2)\). By comparing volumes, we find \[M(r/2)^m \leq (R+r/2)^m.\] Hence, if the radius of each ball is \(r\), there exists a covering of \(B(0,R)\) with \(\lceil (1+2R/r)^m \rceil\) balls. Also requiring \(d/(N-1) > \lceil (1+2R/r)^m \rceil\), we conclude that \[\sup_{\theta} \min_{\substack{S\subset\theta\\|S|=N}} \operatorname{diam}(S) \leq \frac{4R}{\left(\frac{d}{N-1}-1\right)^{1/m}-1} = R\,O((N/d)^{1/m}) .\] ◻
Proof. Denote \(N_{m,k} = \binom{m+k}{k}\). In this proof, we show that the general algorithm (Algorithm 2) with greedy clustering strategy satisfies the asserted error bounds. By the greedy strategy (also described in Sec. 4.2 and Appendix 10), we mean that in each iteration we choose a subset \(S\subseteq |\operatorname{supp}(c)|\) of \(N_{m,k}+1\) objects among the active (i.e., with nonzero weight) objects. Then we reduce one object in \(S\) while maintaining up to the \(k\)th moment.
By Theorem 3, the output error of one step is of order \(c_S r^{k+1}\), where \(c_S\) is the total weight of this cluster, and \(r\) is the diameter. Because the greedy strategy always look for optimizing the diameter, after \(O(d)\) steps, there is no better upper bound for \(c_S\) than \(d\). By Lemma 1, when there are \(N\) active objects, the smallest diameter is upper-bounded by \[O \left(\left( \frac{N_{m,k}+1}{N} \right)^{1/m} \right) = O(N^{-1/m}) .\]
Now, we sum up the error of all the steps. Denote the function’s error as \(\mathcal{E} = \left| \phi(c') - \phi(c) \right|\). Then, in one step of pruning, the error is \[\Delta \mathcal{E} = O(dr^{k+1}) = O (d N^{-(k+1)/m}) .\] We upper-bound the sum over \(N\) by an integral using \(\sum_{N=d'+1}^{d} N^{-\alpha} \leq \int_{d'}^{d} N^{-\alpha}\, dN\), and find that pruning \(d\) original objects into \(d'\) objects results in \[\label{eq:intrgral95error} \begin{align} \mathcal{E} &= O\left( \int_{d'}^d d\, N^{-(k+1)/m} \, dN \right) \\ &= O\left( \frac{d}{\frac{k+1}{m}-1} \left( \left(d'\right)^{1-\frac{k+1}{m}} - d^{1-\frac{k+1}{m}} \right) \right) \end{align}\tag{8}\] or logarithmic if \(k+1=m\); but as we are only concerned with vanishing error, we will skip the analysis of \(k+1=m\). Since \(d'<d\), we conclude that the error is upper-bounded by \[\mathcal{E} = O\left( d \, (d')^{1-\frac{k+1}{m}} \right) ,\] which completes the proof of statement 1.
To show statement 2, we look at the equation \[\omega(d) \geq \mathcal{E} = O\left( d \, (d')^{1-\frac{k+1}{m}} \right)\] Solving this equation with respect to \(d'\) gives the asserted scaling of \(d'\). ◻
In establishing Theorem 4, we have been fixing \(k\) as a hyperparameter which can be chosen at will. Here, we will optimize over the choice of \(k\) to study the smallest achievable final size \(d'\) such that the error is vanishing. In the derivation below, \(k\) could scale up with \(d\), but we always set \(m\) to be a finite constant.
Theorem 7. Assume \(\|w_i\|\leq R\) for all \(i\in[d]\). Let \(f\) be a symmetric function, and let \(\phi\) be the corresponding function with respect to weights.
There exists a mapping from \(d\) uniformly weighted objects into \(d'\) weighted objects, inducing error \[\mathcal{E} = \left| f_d(\theta') - f_d(\theta) \right| = O\left( d (d')^{1-1/m} \exp\left[ -\frac{1}{e}(m!\rho d')^{1/m} \right] \right) ,\] where \(\rho\) is the radius of convergence of the function \(g\) in the deep–set representation of \(f\) (Eq. 2 ).
\(d\) uniformly weighted objects can be compressed into \[d' = \Omega\left( \log\frac{d}{\omega(d)} \right)^m\] weighted objects, such that the error is no larger than \(\omega(d)\).
Proof. Essentially, we follow the same greedy strategy and the same reasoning as the proof of Theorem 4. However, since \(k\) can be large, here we keep track of all factors that scales with \(k\) or \(d\).
Recall that the upper bound for the radius of a cluster is \(((N_{m,k}+1)/N)^{1/m}\), and the upper bound for \(c_S\) is \(d\). Also, since the convergence radius of \(g\) is \(\rho\), the constant factor for the \((k+1)\)th order Taylor expansion scales as \(\rho^{-k}\). Putting these together, the error of one step of pruning is upper-bounded by \[\label{eq:deltaE95full} \Delta \mathcal{E} = O\left( d \rho^{-k} \left( \frac{N_{m,k}+1}{N} \right)^{\frac{k+1}{m}} \right)\tag{9}\] For notation simplicity, we will drop the big-\(O\) notation and use \(\Delta\mathcal{E}\) to refer to the upper-bound expression on the right-hand side of Eq. 9 . There is an optimal \(k_{\rm opt}\) that minimizes the single-step error. We solve it by taking derivative of \(\log (\Delta\mathcal{E}/d)\). \[\label{eq:logDeltaE} \begin{align} \log(\Delta\mathcal{E}/d) &= -k\log\rho + \frac{k+1}{m} \left( \log (N_{m,k}+1) - \log N \right)\\ &= -k\log\rho + \frac{k+1}{m} \left( \log(m+k)! - \log k! - \log(m!N) \right) + O(1) \\ &= k\log k + \log k - \frac{k+1}{m}\log(m!N\rho) + O(1) . \end{align}\tag{10}\] In the third line, we used Stirling’s formula \(\log(n!) = n\log n - n + \frac{1}{2}\log (2\pi n) + \frac{1}{12n} + O(n^{-3})\) when \(n\to\infty\) and simplified the expression. The derivative of the above reads \[\frac{d}{dk} \log(\Delta\mathcal{E}/d) = \log k+1-\frac{1}{m}\log(m!N\rho) + O(k^{-1}) .\] Setting the derivative to zero, we obtain \[\log k_{\rm opt} = \frac{1}{m} \log (m! N \rho) -1 + O(k^{-1})\] Plugging this value into Eq. 10 , we get \[\begin{align} \log(\Delta\mathcal{E}_{\mathrm{opt}}/d) &= -k_{\mathrm{opt}} + O(1) \\ \Rightarrow\quad \Delta \mathcal{E}_{\rm opt} &= O\left( d \exp\left[ -\frac{1}{e} (m!N\rho)^{1/m} \right] \right) . \end{align}\]
Moving forward, we integrate over \(\Delta \mathcal{E}_{\rm opt}\) from \(d'\) to \(d\): \[\label{eq:integral95error95optk} \begin{align} \mathcal{E}_{\rm opt} &= O\left( d \int_{d'}^d \exp\left[ -\frac{1}{e} (m!N\rho)^{1/m} \right] dN \right) \\ &= O\left( d\, \Gamma\left( m, \frac{1}{e} (m!\rho d')^{1/m} \right) \right) , \end{align}\tag{11}\] where \(\Gamma\) is the incomplete Gamma function: \[\Gamma(m,z) \equiv \int_z^\infty t^{m-1}e^{-t} dt .\] Its asymptotic behavior at \(z\to\infty\) reads \[\Gamma(m,z) = \exp\left[ -z+O(z^{-1}) \right] z^{m-1} \left( 1 + O(z^{-1}) \right) .\] Inserting this expansion into Eq. 11 , we get the asserted error scaling in part 1 of the theorem.
To obtain part 2, we solve the inequality \[\omega(d) \geq \mathcal{E} = O\left( d (d')^{1-1/m} \exp\left[ -\frac{1}{e}(m!\rho d')^{1/m} \right] \right)\] with respect to \(d'\). One can take log on both sides of this inequality and safely neglect the factor \((d')^{1-1/m}\) to eventually get \[d' = \Omega\left( \log\frac{d}{\omega(d)} \right)^m .\] ◻
In this Appendix, we formulate all ideas mentioned in Sec. 5 with mathematical rigor.
Let \(S_d\) be the permutation group of \(d\) elements. Let \(V = \mathbb{R}^m\). We define the representation \(R:S_d\mapsto \operatorname{End}(V^d)\) as \[\label{eq:Sd95rep} R(\sigma)(w_1, \dots, w_d) = (w_{\sigma(1)}, \dots, w_{\sigma(d)}) .\tag{12}\] Note that the definition of \(f:V^d\mapsto V^d\) being symmetric is equivalent to: for any \(\sigma\in S_d\), \(f\circ R(\sigma) = f\).
Definition 3 (Equivariant map). A function \(\mathcal{T}:V^d\mapsto V^d\) is called an equivariant map if for any \(\sigma\in S_d\), \[\mathcal{T} \circ R(\sigma) = R(\sigma) \circ \mathcal{T} .\]
The dynamics induced by equivariant maps has been studied in conventional settings of dynamical systems [29] but has not received much attention in deep learning. In fact, almost all update rules that we commonly use are equivariant, which we will verify for SGD and Adam later in this Appendix. Because compositions of equivariant maps are also equivariant, it follows that the entire training dynamics (i.e., the mapping from initial model parameters to trained parameters) is equivariant.
The following lemma shows that the composition of a symmetric function with an equivariant mapping is also a symmetric function.
Lemma 2. If \(f: V^d\mapsto \mathbb{R}\) is a symmetric function and \(\mathcal{T}: V^d\mapsto V^d\) is an equivariant map, then \(f\circ\mathcal{T}\) is a symmetric function.
Proof. For any \(\sigma\in S_d\), \[(f\circ\mathcal{T})\circ R(\sigma) = (f\circ R(\sigma)) \circ \mathcal{T} = f\circ\mathcal{T} .\] ◻
Thanks to this lemma, we can treat \(f \circ \mathcal{T}\) as a single symmetric function, and thus we expect that compressing the weights by our moment matching method induces a vanishing error in the value of \(f \circ \mathcal{T}\), that is, literally any prediction of the trained model.
The following lemma establishes a general representation of equivariant maps in terms of moments, so it can be viewed as an extension of the FTSP.
Lemma 3. \(\mathcal{T}\) is an equivariant map if and only if for all \(i\in[d]\), \[\label{eq:ev95fundamental95rep} \mathcal{T}_i (\theta) = T(w_i, \boldsymbol{p}) ,\qquad{(3)}\] where \(\boldsymbol{p}\) is a collective notation for \(\{p_k = \frac{1}{d}\sum w_i^{\otimes k}\}_{k=1}^{d-1}\). Note that \(T\) is a function that does not depend on \(i\), and is uniquely determined by \(\mathcal{T}\).
Proof. First, we verify that for any function \(T\), \(\mathcal{T}_i (\theta) = T(w_i, \boldsymbol{p})\) defines an equivariant map \(\mathcal{T}\). Note that \(\sigma(\boldsymbol{p}) = \boldsymbol{p}\). Following the definition of equivariance, \[\mathcal{T}_i \circ R(\sigma) (\theta) = \mathcal{T}_i (w_{\sigma(1)}, \dots, w_{\sigma(d)}) = T(w_{\sigma(i)}, \boldsymbol{p}) = \mathcal{T}_{\sigma(i)} (\theta) .\] Thus, \(\mathcal{T}\circ R(\sigma) = R(\sigma) \circ \mathcal{T}\).
The rest of this proof is to show that all equivariant maps can be written in the asserted form. For a fixed \(i\), let \(\hat{\sigma}\in S_d\) be any permutation group element such that \(\hat{\sigma}(i) = i\). Consider the \(i\)th component of the equation \(\mathcal{T} \circ R(\hat{\sigma}) (\theta) = R(\hat{\sigma})\circ \mathcal{T}(\theta)\): \[\mathcal{T}_i\circ R(\hat{\sigma})(\theta) = \mathcal{T}_i(\theta) .\] This means that \(\mathcal{T}_i\) is invariant under any permutation of the indices \([d]\backslash \{i\}\), which forms a representation of \(S_{d-1}\). By Theorem 1, \(\mathcal{T}_i\) can be uniquely written as a function of \(w_i\) and power sums \(\sum_{j\neq i} w_j^{\otimes k}\) for \(k\in 0,1,\dots, d-1\). Since \(\sum_{j\neq i} w_j^{\otimes k}\) is uniquely determined by \(w_i\) and \(p_k\) as \(dp_k-w_i^{\otimes k}\), we conclude that there is a unique function \(T_i\) such that \[\mathcal{T}_i(\theta) = T_i (w_i, \boldsymbol{p}).\]
The final task is to show that \(T_i\) does not depend on \(i\). We use \(\pi_{i,j}\) to denote the permutation group element that only exchanges \(i\) and \(j\). The \(i\)th component of the equation \(\mathcal{T}\circ R(\pi_{i,j})(\theta) = R(\pi_{i,j}) \circ \mathcal{T}(\theta)\) reads \[\label{eq:equivariant95rep} T_i(w_j, \boldsymbol{p}) = T_j(w_j, \boldsymbol{p}) ,\tag{13}\] which completes the proof. ◻
We use Lemma 3 to unambiguously define the compressed training dynamics:
Definition 4 (Compressed dynamics). Suppose a training dynamics is determined by an equivariant map \(\theta = \mathcal{T}(\theta_0)\) (arbitrarily initialized weights to trained weights), which can be written as in Eq. ?? . For the weighted parameters \(\theta' = \{c_j, w_j\}_{j\in [d]}\) (\(c_j\) never changes with the dynamics; some \(c_j\) might be zero so that they are actually pruned), we define the compressed dynamics \(\mathcal{T}'\) as \[\label{eq:compressed95dynamics} \mathcal{T}'_j (\theta') = T(w_j, \boldsymbol{p}') ,\qquad{(4)}\] where \(\boldsymbol{p}'\) is a collective notation for \(\{p'_k = \frac{1}{d}\sum_j c_j w_j^{\otimes k}\}_{k=1}^{d}\). Note that \(\mathcal{T}'\) is uniquely determined by \(\mathcal{T}\).
The mapping from original learning dynamics to compressed ones is in fact easy in practice. Below are some common examples.
Stochastic gradient descent (SGD). Consider a two-layer neural network used in supervised learning. For simplicity, we write the output as \(y = \sum_{i=1}^d g(w_i, x)\), which is symmetric in \(\theta=(w_1, \dots, w_d)\). Each time we choose a batch of training data \(\{(x_a, y_a)\}_{a\in \mathcal{B}}\). We denote the per-sample loss function as \(\ell_a = \ell (y(\theta, x_a), y_a)\). The SGD update rule is \[\label{eq:SGD95update} \begin{align} \mathcal{T}_i(\theta) &= w_i - \eta \mathop{\mathbb{E}}_{a\in\mathcal{B}}\frac{\partial \ell_a}{\partial w_i} \\ &= w_i - \eta \mathop{\mathbb{E}}_{a\in\mathcal{B}}\frac{\partial \ell_a}{\partial y} \frac{\partial g(w_i, x_a)}{\partial w_i} \end{align}\tag{14}\] where \(\eta\) is the learning rate. Note that \(\partial\ell_a / \partial y\) is a function of \(y\), which is thus permutation invariant in \(\theta\). We can explicitly compute \((\mathcal{T}\circ R(\sigma))(\theta)\) and \((R(\sigma) \circ \mathcal{T})(\theta)\), which are both equal to \[w_{\sigma(i)} - \eta \mathop{\mathbb{E}}_{a\in\mathcal{B}}\frac{\partial \ell_a}{\partial y} \frac{\partial g(w_{\sigma(i)}, x_a)}{\partial w_{\sigma(i)}} ,\] so SGD is indeed equivariant.
Then, we derive the compressed SGD. Remember that the weighted neurons compute the output as \(y = \sum_{j=1}^{d'} c_j g(w_i, x)\). Eq. 14 can indeed be written in the form \(T(w_i, \boldsymbol{p})\) because \(\partial\ell_a / \partial y\) is a function of \(\boldsymbol{p}\), and \(\partial g(w_i, x_a)/\partial w_i\) solely depends on \(w_i\). Therefore, the compressed update rule still looks like \[\mathcal{T}'_j(\theta) = w_j - \eta \mathop{\mathbb{E}}_{a\in\mathcal{B}}\frac{\partial l_a}{\partial y} \frac{\partial g(w_i, x_a)}{\partial w_j} .\] However, we emphasize that it is not \(w_j - \eta \mathop{\mathbb{E}}_{a\in\mathcal{B}} \partial \ell_a/\partial w_j\), because \[\frac{\partial \ell_a}{\partial w_j} = \frac{\partial \ell_a}{\partial y} c_j \frac{\partial g(w_j, x_a)}{\partial w_j}\] Effectively, whenever there is a gradient \(\partial L/\partial w_j\) in the original update, we should replace it by \(c_j^{-1} \partial L/\partial w_j\). This rule applies for all other gradient-based updates as well.
Finally, we remark on non-deterministic updates. It seems that choosing mini-batches turns the update into a stochastic process, which complicates the problem. But in fact, for any fixed trajectory of mini-batch choices, the update is explicitly equivariant (choosing a mini-batch breaks the permutation symmetry among the training dataset, but not the permutation symmetry of neurons). Therefore, we always expect to see good agreement between the original and compressed dynamics if we use the same choice of mini-batches for both.
Adam. The Adam update rule for standard (i.e., uniformly weighted) parameters reads \[\begin{align} g_t &\leftarrow \nabla_{\theta} \mathop{\mathbb{E}}_{a\in\mathcal{B}}\frac{\partial\ell_a}{\partial \theta} (\theta_{t-1}) \\ m_t &\leftarrow \beta_1 m_{t-1} + (1-\beta_1) g_t \\ v_t &\leftarrow \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\ \hat{m}_t &\leftarrow \frac{m_t}{1-\beta_1^t} \\ \hat{v}_t &\leftarrow \frac{v_t}{1-\beta_2^t} \\ \theta_t &\leftarrow \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} , \end{align}\] where \(t\) denotes time step. To check that Adam is equivariant, we only need to note that the gradient (\(g_t\)) for \(w_i\) takes the form \[\label{eq:explicit95grad} \mathop{\mathbb{E}}_{a\in\mathcal{B}}\frac{\partial \ell_a}{\partial y} \frac{\partial g(w_i, x_a)}{\partial w_i} ,\tag{15}\] where \(\partial\ell_a/\partial y\) is symmetric, and \(\partial g(w_i, x_a)/\partial w_i\) is solely a function of \(w_i\). Using this fact, it is straightforward to check that \((\mathcal{T}\circ R(\sigma))(\theta)\) and \((R(\sigma) \circ \mathcal{T})(\theta)\) are identical.
To define the compressed version of Adam, we need to keep the gradient exactly as in Eq. 15 , similar to our previous discussion on SGD. This in turn tells us that we just need to replace \(\partial L/\partial w_j\) by \(c_j^{-1} \partial L/\partial w_j\). Special to Adam, because \(1/c_j\) appears in both \(\hat{m}_t\) and \(\sqrt{\hat{v}_t}\), the compressed update rule is basically the same as the original if we neglect the small \(\epsilon\). Indeed, in the numerical simulations we conducted with AdamW (the reasoning is the same as Adam), we did not observe any visible difference whether or not to scale the gradients by \(1/c_j\).
Finally, we are ready to prove the dynamical LTH, which we restate here for convenience.
Theorem 1 (Theorem 5, Dynamical lottery ticket hypothesis). Let \(\theta \in V^d\) denote a set of permutation-symmetric trainable parameters of a neural network. Suppose the training dynamics \(\mathcal{T}\) is equivariant and the model prediction \(f(\theta)\) is symmetric. Then, for any initial parameter \(\theta_0\), there exists a compressed weighted parameter \(\theta'_0\), supported on \(d' = O\!\left(\log^m \tfrac{d}{\omega(d)}\right)\) points, such that \[\big| f(\mathcal{T}'(\theta'_0)) - f(\mathcal{T}(\theta_0)) \big| \;=\; \omega(d) .\]
Proof. We compress \(\theta_0\) using moment matching. By construction, for all \(l = 0,1,\dots, k\) we have \[\sum_{i=1}^d (w_0)_i^{\otimes l} = \sum_{j\in S} c_j (w_0)_j^{\otimes l}\] For any \(f\), \(f\circ \mathcal{T}\) and \(f\circ\mathcal{T}'\) can both be written as a function of moments. In this representation, using the definition in Eq. ?? , one can check that they are exactly the same function. The difference is that \(f\circ\mathcal{T}\) takes in \((p_0)_k = \frac{1}{d} \sum_i (w_0)_i^{\otimes k}\), while \(f\circ\mathcal{T}'\) takes in \((p'_0)_k = \frac{1}{d} \sum_j c_j (w_0)_j^{\otimes k}\); they are identical in the first \(k\) moments by construction. Using Theorem 4, we conclude that using large enough \(k\), the difference between \(f(\mathcal{T}(\theta_0))\) and \(f(\mathcal{T}'(\theta'_0))\) can always be made vanishing, with the same error upper-bound as asserted in Theorem 4. Ultimately, using the optimal \(k_{\mathrm{opt}}\) given in Theorem 7, we achieve \(d' = O\!\left(\log^m \tfrac{d}{\omega(d)}\right)\) with error at most \(\omega(d)\). ◻
Here, we describe the concrete algorithm for compression that is used in all our numerical simulations.
Recall that in Algorithm 2 we identified two main steps of the algorithm: (1) clustering and (2) moment matching. We first describe our moment matching strategy. For moment-matching, we consistently use Algorithm 7, in which \(\operatorname{vect}(\cdot)\) represents flattening all the moments into a vector. We remark that the existence of Algorithm 7 is effectively a constructive proof of Tchakaloff’s Theorem 2.
Our clustering strategy is slightly more involved, because finding the smallest cluster in \(d\gg 1\) points is known to be NP-hard. We implement a coarser \(k\)-means clustering instead.
Only when \(|\operatorname{supp}(c)|\) becomes close to the desired stopping size \(d'\), we switch to a greedy strategy, since the diameters of clusters are likely to be large when
\(|\operatorname{supp}(c)|\) is small. Concretely, in a \(k\)-means round, we divide \(\theta\) into \(\propto
|\operatorname{supp}(c)|/N_{m,k}\) clusters. Then we apply Algorithm 7 to each cluster containing more than \(N_{m,k}\) objects in parallel. In a greedy round, we find the approximately
smallest cluster of \(N_{m,k}+1\) points, which is implemented using the \(k\)-nearest neighbor algorithm provided by the faiss
package [30].
Finally, we remind readers that our error bound Theorems 4 and 7 are proven for the greedy strategy, where in each round on finds the smallest cluster of \(N_{m,k}+1\) objects and reduce one only. For the above mentioned hybrid clustering strategy, there is no theoretical guarantee as strong as Theorem 4 for the error, but all numerical simulation turns out to meet our expectation.
We studied two main numerical tasks in Figs. 4, 5 and 6. For all these experiments, the loss function in training is the mean squared error (MSE) loss. The test loss shown in the figures is the MSE loss evaluated on a randomly sampled dataset of the form \((x_1, x_2, f(x_1, x_2))\), where \(f(x_1, x_2)\) is the ground truth function. For Figs. 4 and 5, the test dataset size is \(10^5\), and for Fig. 6 it is \(2\times 10^4\). All unspecified training hyperparameters follow PyTorch defaults. In particular, for AdamW, the hyperparameters are \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), (weight decay) \(\lambda = 0.01\), and \(\epsilon = 10^{-8}\).
Figs. 4 and 6 (a) concern compressing the training dataset. The task is function fitting in a teacher–student setup, described in Sec. 4.2. When the training dataset is weighted and denoted as \(\theta' = \{(c_j, w_j)\}_{j=1}^{d'}\), the data loader is implemented as follows: we draw i.i.d.samples from \(\{w_j\}_{j\in[d']}\), using \(\{c_j\}_{j\in[d']}\) as an unnormalized probability distribution, to form a mini-batch.
Figs. 4 and 6 (b) concern compressing the width of a two-layer neural network. The task is fitting an oscillating bivariate function known as a cylindrical harmonic: \[f(x_1, x_2) = J_6(20r) \cos(6\theta) ,\] where \(J_n\) is the Bessel function, \(r=\sqrt{x_1^2+x_2^2}\) and \(\theta=\arctan(x_2/x_1)\), as also described in Sec. 5. The training dataset size for Fig. 5 is \(10^5\), and for Fig. 6 (b) is \(2\times 10^4\).
In Fig. 6, we show the test MSE loss scaling with respect to the training dataset size (a) and neural network width (b). For both (a) and (b), the update rule is AdamW. The learning rate is initially \(0.001\), and is modulated by a cosine annealing learning-rate scheduler, reaching \(0\) at the final epoch. Each data point is obtained by averaging \(10\) random instances of the training and test datasets and the neural network initialization, and the error bars indicate the standard deviation. For (a), we train each instance for \(2048\) epochs, each epoch containing one mini-batch of size \(512\), so that there is a constant compute budget for different sizes of the original and compressed datasets. For (b), we train each instance for \(2000\) epochs, with each epoch enumerating the training dataset. The batch size is \(128\).
Correspondence to: ziyinl@mit.edu↩︎