NeuroPrune: A Neuro-inspired Topological Sparse Training Algorithm for Large Language Models

Amit Dhurandhar\(^{*}\), Tejaswini Pedapati\(^{*}\), Ronny Luss\(^{*}\),
Soham Dan, Aurelie Lozano, Payel Das and Georgios Kollias

IBM Research, Yorktown Heights, NY****


Abstract

Transformer-based Language Models have become ubiquitous in Natural Language Processing (NLP) due to their impressive performance on various tasks. However, expensive training as well as inference remains a significant impediment to their widespread applicability. While enforcing sparsity at various levels of the model architecture has found promise in addressing scaling and efficiency issues, there remains a disconnect between how sparsity affects network topology. Inspired by brain neuronal networks, we explore sparsity approaches through the lens of network topology. Specifically, we exploit mechanisms seen in biological networks, such as preferential attachment and redundant synapse pruning, and show that principled, model-agnostic sparsity approaches are performant and efficient across diverse NLP tasks, spanning both classification (such as natural language inference) and generation (summarization, machine translation), despite our sole objective not being optimizing performance. NeuroPrune is competitive with (or sometimes superior to) baselines on performance and can be up to \(10\)x faster in terms of training time for a given level of sparsity, simultaneously exhibiting measurable improvements in inference time in many cases.

1 Introduction↩︎

In the past decade, transformer-based models [1] leveraging attention mechanisms have led to state-of-the-art performance on NLP tasks and other multimodal applications, in both classification and generation settings. Despite the performance improvements, the computational overhead required for training, and inference, hinders progress. The models are large and are typically parameterized by many dense matrices, which also begs the question as to whether this complexity is necessary for better performance.

Sparsity in general neural networks has been considered using sparse regularizations on weights [2] and weight thresholding/masking [3]. Specifically for transformers, various attention masking patterns have been studied [4]. Another direction for inducing sparsity is to remove entire attention heads altogether [5]. Previous sparse methods, however, give little emphasis to the topology of the networks being trained [6].

In this paper, we study how certain network topologies can be exploited in transformer-based large language models (LLMs) to offer sparser models (in terms of fewer parameters and fewer attention heads overall) while maintaining performance. Our framework, which we call NeuroPrune, is model-agnostic as well as task-agnostic, and is a dynamic sparse training method inspired by biological neuronal networks present in the brain. For example, [7] discusses two stages by which connections (synapses) in a neuronal network evolve in the brain. First, an overabundance of synapses is created, which is similar to the pretraining an LLM. In the second stage, synapses are judiciously removed until stability in the network is achieved, which is akin to fine-tuning an LLM for a particular task by inducing sparsity, or at a higher level, by removing attention heads. Our framework relates sparsity within the Multi-Layer Perceptron (MLP) layers and the attention heads, as well as sparsity at the level of attention heads, to two distinct processes that take place during that second stage of neuronal network development: preferential attachment (i.e. rich-get-richer) [8] and the elimination of redundancy [9].

Figure 1: Resulting sparsity patterns (\(\approx 50\%\), \(\approx 90\%\)) as determined by NeuroPrune in an intermediate Transformer layer of a BERT(-base) model learned on the SST2 GLUE benchmark dataset [10]. NeuroPrune sparsifies according to a preferential attachment topology as entire rows/columns of the attention and MLP matrices are zeroed out. Quantitatively, the standard deviation (sd) between the connectivity of neurons in the MLP layers increases up to two orders of magnitude (\(50\%\): \(12.12\), \(90\%\): \(4.16\)) compared with standard fine-tuning (\(0.13\)) as seen in Figure 2, This increase in sd is indicative of preferential attachment, similar to what is seen in biological neurons [8], where a minimal rich-get-richer mechanism is used to produce sparse and heavy-tailed networks. The pattern is qualitatively similar for other layers too, as can be seen in the Appendix.

a

b

Figure 2: Left is the MLP degree distribution for the SST2 dataset using a BERT model indicative of preferential attachment for NeuroPrune as sparsity increases (echoing the degree distribution in brain functional networks [11]). Standard fine-tuning creates a dense network (black vertical line). Right we see the non-uniformity in connectivity at different sparsity %s across GLUE tasks using NeuroPrune, indicative of this preferential attachment across tasks..

Preferential attachment inspired regularization Within MLP layers and attention heads, NeuroPrune, is motivated by a well-known network concept, called preferential attachment, which was found to be highly relevant in neuronal networks in the brain [8], [11] over the last two decades. The general notion is that over time, neurons with more connections build even more connections, while those with fewer connections are removed. Similarly, our framework induces weighted \(l_1\) sparsity in MLP layers (weight inversely \(\propto\) connectivity/degree) and group sparsity within attention heads, so that influential neurons (measured by attention parameters) are maintained while those with little influence are pruned. Modeling the removal of weak synapses is an established approach [12] to understanding the refinement process of neurons in the brain. For LLMs, this effort is illustrated in Figure 1 where, NeuroPrune, sparsifies the parameter matrices of transformers by zeroing out entire rows in attention and MLP layers. Quantitatively, this non-uniformity in connectivity is verified in Figure 2.

a

b

Figure 3: Left is the fraction of times a head is removed using NeuroPrune when fine-tuning on SST2 with a BERT model. The overall numbers (blue curve) are averaged across layers and runs (\(\pm\) sd), where at least \(10\) heads are removed. We also show individual layer numbers averaged across runs for the top three layers where most pruning of heads happens (Figure 13 in appendix). We see there is a significant bias towards keeping the last head in each layer leading to a more modular structure and also showcasing preferential attachment, as neurons from the previous and next layers will connect only to these heads. The middle head (head \(5\)) is also maintained more than most other heads, as it replaces many of the earlier heads. Results averaged across GLUE tasks on the right are similar..

Redundancy-based pruning While structured sparsity aims at preferential attachment, such pruning (by zeroing out weights) cannot determine which connections are redundant. Elimination of redundant connections is an important aspect of the refinement process [9], [13] that takes place after the brain develops very dense networks of connections. In the case of LLMs, we hypothesize that such redundancy can be measured by similarity between attention heads, whereupon similar attention heads can then be merged, resulting in reduced complexity while maintaining functionality (i.e. performance is maintained on downstream tasks). Such removal of redundancy is conjectured in [14] to be unique to the neuron development in the central nervous system of vertebrates. Figure 3 illustrates how often heads are found to be redundant using NeuroPrune. Generally, the last head is found to be the least redundant, while the middle head also exhibits limited redundancy. In Figure 13 in the appendix, we see head removal as a function of layers and find that the last three layers have the highest number of redundant heads.

Contributions In this paper, we propose a neuro-inspired topological sparse training algorithm with custom attention and MLP (structured) sparsity regularizations based on preferential attachment, and a novel redundancy-based head pruning scheme, which we map to the dominating set problem [15] in theoretical computer science.

Our approach has the following benefits: 1) It is task agnostic. 2) It is easily adaptable to different transformer-based LLM architectures as it does not add additional (mask) variables to do the pruning. We apply it to BERT (encoder), T5 (encoder-decoder) and OPT (decoder) models. 3) It learns sparsity patterns exhibiting principled topological structure. 4) It results in LLMs with a competitive and even sometimes superior performance on different benchmarks and tasks (GLUE, summarization, machine translation), although our proposal is more neuroscience-motivated than solely trying to maximize performance. 5) It is generally much faster to train than the competing baselines, with time per epoch being similar to standard fine-tuning. It also exhibits inference speedups as the topological constraints encourage \(N\):\(M\)-type sparsity.

2 Notation↩︎

A Transformer consists of multiple identical units. Each unit in turn is comprised of a Multi-Head Attention (MHA) Layer and a Feed Forward (FFN) or MLP Layer (used interchangeably). Each attention layer is partitioned into multiple heads \(H_{i}\) composed of Query \(Q_{H_{i}}\), Key \(K_{H_{i}}\) and Value \(V_{H_{i}}\) parameter matrices. If \(d\) denotes the embedding dimension of each token in an input matrix \(X\) then, \[\begin{align} H_{i} = softmax\left(\frac{XQ_{H_{i}} K_{H_{i}}^{T}X^T}{\sqrt{d}}\right) XV_{H_{i}} \end{align}\] A MHA layer with \(k\) heads computes the attention of all heads in parallel and concatenates them. \(MHA = Concat(H_{1},\ldots, H_{k})W^{O}\), where \(W^{O}\) is an output dense matrix.

The FFN layer in turn has two linear layers, one to expand the dimensions \(L_{in}:\mathcal{R}^{d_e}\times \mathcal{R}^{d}\) and the other to project it back to the original dimension \(L_{out}:\mathcal{R}^{d}\times \mathcal{R}^{d_e}\). Typically \(d_e>>d\) (e.g. in BERT \(d_e=3072\) and \(d=768\)) with \(L\) denoting the concatenation of all the MLP layers.

If \(Q\), \(K\), \(V\) are the Query, Key, and Value matrices of all the heads in a MHA layer concatenated together, then \(Q_{H_{i}}\), \(K_{H_{i}}\) \(V_{H_{i}}\) corresponds to the \((i-1)\frac{d}{k}+1\) to \(i\frac{d}{k}\) columns in each such matrix respectively. Let \(L_{in,\mathcal{H}_i}\) denote the corresponding columns of the MLP and let \(A_{\mathcal{H}_i}=[Q_{\mathcal{H}_i}, K_{\mathcal{H}_i}, V_{\mathcal{H}_i}]\).

The use of the superscript in \(A^{(l)}\) denotes (attention) layer \(l\) of the transformer. We use \(\overrightarrow{\boldsymbol{x}}\) to signify that \(x\) is a vector.

3 Related Work↩︎

Quantization [16], Knowledge Distillation to a smaller model [17], and Model Pruning are some ways to alleviate the extensive computational cost required by LLMs. Here we review prior work on model pruning, which is most relevant to us.

3.1 Unstructured Pruning↩︎

Unstructured pruning [18][22] removes the less salient parameters from the model, thereby achieving sparsity. Based on the lottery ticket hypothesis, [23] performs iterative magnitude pruning. [24], [25] apply the lottery ticket hypothesis to the BERT model. This class of pruning algorithms attain high sparsity while largely maintaining accuracy, but are mostly post hoc. Moreover, the resulting pruned models do not provide much inference speedup.

3.2 Structured Pruning↩︎

A neural network can be divided into blocks or components. For instance, channels, kernels, and layers for a convolution neural network, an attention head, and a fully connected layer for a transformer. Structured Pruning [26], [27] involves removing an entire component, thus eliminating some of the multiply-and-accumulate computations, thereby accelerating inference.

3.2.0.1 Head Pruning

[5] were the first to examine if all the heads are necessary for a BERT model. They defined the importance of a head by the drop in performance of the model upon removing the head. [28] apply gates to each head and learn these gates using \(l_0\) regularization. [29] also learn gates and identifies a subset of heads for each layer of the BERT model such that the drop in the model’s performance is minimal. They sample the top-k heads based on their importance score and use the Gumbel-softmax trick to make the top-k formulation differentiable. This method was shown to have superior results to other competitors on BERT, and we thus compare it with NeuroPrune’s head pruning strategy.

3.2.0.2 Block and Layer Pruning

[27] and [30] experimented with dropping different transformer layers, such as every alternate layer, or top-k layers, or middle layers, and found inference speedups. [31] divide the MHA and FFN layers into several blocks, and apply masks to each of the blocks to prune them.

CoFI [6] also prunes a transformer by applying gates to each of the heads \(m_{h}\), one mask to the entire MHA layer, and finally, one to the MLP layer in the block. The model is then trained using \(l_0\) regularization to learn these gates. The sparsity constraints are imposed using Lagrangian multipliers. To further boost the performance of the pruned model, CoFI jointly prunes and performs layer-wise distillation.

As Neuroprune also prunes the attention matrices, the feed-forward layer, and attention heads, this is the closest baseline when varying the percentage of sparsity. Note that in contrast to CoFI, we do not require any additional mask variables where an architecture has to be modified, and hence, our approach is easily transferable across model architectures. We demonstrate this via experiments on BERT (an encoder-only model), T5 (an encoder-decoder model), and OPT (a decoder-only model).

4 Method↩︎

We propose (topological) sparsification of a Transformer block at three levels: i) The two (expand and contract) Multi-Layer Perceptron (MLP) layers, ii) the attention layers, and iii) head pruning at the level of attention layers. Our method, NeuroPrune, is detailed in Algorithm 4 with two sub-procedures given in Algorithms 5 and 6. The three sparsifications are described next.

Figure 4: NeuroPrune (Code provided)

Figure 5: ELIM_REDUNDANT

Figure 6: FIND_DOMINATING

4.1 MLP Sparsification↩︎

Preferential sparsification of the MLP layers is conceptually the simplest component of NeuroPrune. For each \(L_{in}\) and \(L_{out}\) matrix in each Transformer layer, a weighted \(l_1\) penalty is added to the training objective, where the weights for each row of entries in the matrix are inversely proportional to the (fractional) connectivity of that neuron. Specifically, let \(n_{in,i}\) be the number of entries in the \(i^{\text{th}}\) row of \(L_{in}\) with absolute values less than some small \(\epsilon>0\) (with \(n_{out,i}\) similarly defined for \(L_{out}\)). The MLP regularizer added to the training loss for layer \(l\) is as follows: \[\begin{align} \label{eq:mlp} R_{mlp}^{(l)}(L^{(l)}) = &\frac{1}{d}[n^{(l)}_{in,1},...,n^{(t)}_{in,d_e}]\cdot|L_{in}^{(l)}|\cdot\vec{1}_d\\&+\frac{1}{d_e}[n^{(l)}_{out,1},...,n^{(t)}_{out,d}]\cdot|L_{out}^{(l)}|\cdot\vec{1}_{d_e} \nonumber \end{align}\tag{1}\] where, \(|.|\) denotes an element-wise absolute value and \(\vec{1}_d\) is a \(d\)-dimensional vector of \(1\)s. In essence, Equation 1 penalizes neurons with less connectivity more than the densely connected ones. This explicitly encourages preferential attachment, yielding a training process where sparsely connected neurons are likely to be weeded out.

4.2 Attention Sparsification↩︎

It is not obvious what topological sparsity based on preferential attachment would entail for attention. We conceive of a novel way of inducing such sparsity by leveraging the rich literature on group sparsity [32], [33].

Considering the connectivity of an input embedding neuron to the output neurons of an attention layer, it is evident that the \(i^{\text{th}}\) embedding dimension only interacts with the \(i^{\text{th}}\) row of the \(Q\), \(K\) and \(V\) matrices. These interactions can be visualized as connections to the output neurons. However, even one non-zero entry in the \(i^{\text{th}}\) row of \(Q\), \(K\), \(V\) leads to the \(i^{\text{th}}\) input neuron being connected to all output neurons. Hence, to remove the effect of this neuron on the output neurons, one needs to zero out the \(i^{\text{th}}\) row in all three matrices. In other words, a group sparsity penalty, where each group is a row of the concatenated \(A=[Q,K,V]\) matrix, is desired. Such a penalty encourages sparse rows to become more sparse as it tries to eliminate those rows by making them (almost) zero, again showcasing preferential behavior.

Rather than adding extra masking variables to implement preferential behavior, we leverage group sparsity and apply an \(l_p^q\) norm penalty on the rows of \([Q,K,V]\), where \(p=1\) and \(q=0.5\). The \(l_1^{.5}\) penalty was seen to be more robust to other choices in [33] as it leads to a sharp reduction in the parameter values belonging to a group. As such we add the following regularization, corresponding to the attention matrix at layer \(l\), to the training loss, where \(A^{(l)}\) is the concatenated \([Q,K,V]\) matrix: \[\begin{align} \label{eq:attn} R_{attn}^{(l)}(A^{(l)}) = \sum_{i=1}^d \sqrt{\sum_{j=1}^{3d} |A_{ij}^{(l)}|} \end{align}\tag{2}\] Note that the above constraint is applied across heads in the attention layer as it considers the entire \(Q\), \(K\), \(V\) matrices (hence the inner summation over \(3d\) entries). Additionally, while the standard \(l_2\) group penalty induces weights within a group to be similar, this \(l_1^{.5}\) group penalty allows sparsity patterns to be learned within a group. For example, in Figures 1 and 10, while entire rows are often removed, we also observe that certain rows only exhibit sparsity in \(Q\) and \(K\) while leaving corresponding rows of \(V\) dense, which is still valuable as it indicates that attention may not be required for those neurons/dimensions.

4.3 Head Pruning↩︎

Unlike the attention and MLP sparsifications mentioned above, head pruning is done after each epoch as seen in Algorithm 4 (NeuroPrune). The main idea here is to remove heads in a layer that are similar to other heads and are hence deemed redundant. We want to remove as many heads as possible in order to get maximum sparsification. NeuroPrune accomplishes this by determining which heads are similar to many other heads, and then maintaining such heads while removing others. Note that similarity is not transitive, and thus removal of heads is not trivial. Algorithm 5 details these steps.

NeuroPrune removes heads that are dominated by other heads, i.e., the dominated head is similar to only a subset of heads that the dominating head is similar to. The problem of keeping a minimum number of heads based on similarity can be mapped to the dominating set problem [15], where each head is a vertex and each edge indicates being similar. We want to find the minimum number of vertices such that they, along with their adjacent vertices, account for all the vertices in the graph. This problem is NP-Hard and our approach (detailed in Algorithm 6) is a quadratic-time approximation to solve this, where it biases towards keeping latter heads in a layer. Since our algorithm also biases towards keeping vertices (heads) with high degrees, our head pruning scheme also elicits preferential behavior.

An important note to make is that, unlike previous methods [5], [29], NeuroPrune does not prune according to head importance, but rather head redundancy, and hence even important heads can get pruned. The experiments indeed show that the average head importance is quite high across the heads we eliminate. This can lead to more aggressive pruning and faster train times as witnessed in our experiments.

4.4 NeuroPrune↩︎

Algorithm 4 puts together the above regularizations and head pruning. Our fine-tuning procedure is very similar to the vanilla Stochastic Gradient Descent (SGD) methods [34] that are typically used for training LLMs. In each epoch, the SGD(\(\cdot\)) term refers to running a single epoch of any SGD algorithm over a batched dataset. The key additions made in NeuroPrune are the MLP and attention regularizations, which appear in the objective being passed to SGD(\(\cdot\)). Head pruning is done after each epoch of stochastic gradient descent in the inner for loop in Algorithm 4.

a

b

c

d

e

f

g

h

i

Figure 7: Performance (\(1^{\text{st}}\) column), inference time (\(2^{\text{nd}}\) column) and train time (\(3^{\text{rd}}\) column) for NeuroPrune and CoFI/\(l_1\) on GLUE tasks at different sparsity percentages. The \(1^{\text{st}}\), \(2^{\text{nd}}\) and \(3^{\text{rd}}\) rows correspond to BERT-base, T5-base and OPT-125m models respectively. In the \(1^{\text{st}}\) row we see that NeuroPrune outperforms CoFI on the smaller GLUE datasets and is competitive on larger ones, with consistently better inference and train times. In the next two rows, we see that NeuroPrune is largely better than \(l_1\) sparsity, especially at intermediate sparsities (25-80\(\%\)), with notable inference time gains and comparable train time..

Table 1: NeuroPrune (NP) vs \(l_1\) pruning on the CNNDaily summarization dataset using T5-base. FT stands for standard fine tuning. As an be seen we are most of the time better on rouge metrics and as well as inference time. The train times are similar. Best values for each sparsity % (\(s\)) are bolded.
\(s\) Meth. \(\uparrow\)Rouge \(\uparrow\)Rouge \(\uparrow\)Rouge \(\uparrow\)Rouge \(\downarrow\)Inf. \(\downarrow\)Train
\(1\) \(2\) L Lsum Time(s) Time(s)
\(0\) FT \(43.18\) \(20.47\) \(30.77\) \(40.41\) \(0.455\) \(24603\)
\(25\) NP \(\boldsymbol{43.07}\) \(\boldsymbol{20.34}\) \(\boldsymbol{30.7}\) \(\boldsymbol{40.31}\) \(\boldsymbol{0.451}\) \(\boldsymbol{24620}\)
2-8 \(l_1\) \(42.19\) \(20.12\) \(29.29\) \(39.33\) \(0.454\) \(24621\)
\(50\) NP \(41.96\) \(\boldsymbol{19.52}\) \(\boldsymbol{29.73}\) \(\boldsymbol{39.2}\) \(\boldsymbol{0.442}\) \(24605\)
2-8 \(l_1\) \(\boldsymbol{42.18}\) \(19.02\) \(29.29\) \(38.65\) \(0.451\) \(\boldsymbol{24601}\)
\(70\) NP \(\boldsymbol{41.6}\) \(\boldsymbol{18.45}\) \(\boldsymbol{28.56}\) \(\boldsymbol{37.93}\) \(\boldsymbol{0.431}\) \(\boldsymbol{24623}\)
2-8 \(l_1\) \(40.1\) \(18.02\) \(27.51\) \(36.63\) \(0.441\) \(24628\)
\(80\) NP \(\boldsymbol{36.92}\) \(\boldsymbol{16.78}\) \(\boldsymbol{26.29}\) \(\boldsymbol{34.35}\) \(\boldsymbol{0.427}\) \(24614\)
2-8 \(l_1\) \(34.27\) \(14.95\) \(25.11\) \(32.79\) \(0.437\) \(\boldsymbol{24610}\)
\(90\) NP \(\boldsymbol{33.92}\) \(\boldsymbol{13.78}\) \(\boldsymbol{24.29}\) \(\boldsymbol{31.35}\) \(\boldsymbol{0.415}\) \(\boldsymbol{24602}\)
2-8 \(l_1\) \(31.88\) \(11.94\) \(23.18\) \(29.22\) \(0.422\) \(24608\)
\(95\) NP \(\boldsymbol{32.17}\) \(\boldsymbol{13.72}\) \(\boldsymbol{23.66}\) \(\boldsymbol{30.97}\) \(\boldsymbol{0.406}\) \(\boldsymbol{24610}\)
2-8 \(l_1\) \(30.25\) \(11.21\) \(21.42\) \(28.16\) \(0.417\) \(24611\)

a

b

c

Figure 8: Performance (left), inference time (center) and train time (right) for NeuroPrune and DSP on GLUE tasks, where different number of heads are present in a BERT-base model are shown above. NeuroPrune is better or similar (rarely worse) in performance to DSP on most datasets, where it is notably more efficient to train. Inference time is (slightly) improved when many heads are removed, however, the DSP code (simply) masks heads rather than explicitly pruning them like ours does and hence if these masked heads are removed the inference time of DSP might also improve as shown in their paper..

5 Experiments↩︎

We now test our method in two different settings: i) varying sparsity and ii) varying number of heads. In each setting, we run our method on the GLUE [10] tasks, where the dev set is used for testing1. For i) we also test our method on the CNN/Daily Mail [35] summarization task. For ii) we also run our method on a machine translation task for German to English on the IWSLT [36] dataset.

Baselines and Models: When varying sparsity we compare against CoFI [6], which is a state-of-the-art (SOTA) method for inducing structured sparsity in LLMs. Since the author-shared code is for BERT (encoder) we compare with CoFI on BERT(-base) for GLUE tasks. We additionally implemented NeuroPrune for T5 (encoder-decoder) [37] and OPT (decoder) [38] models. Since a CoFI implementation was not available for T5 and OPT, and would require modifying the architecture, we apply \(l_1\) sparsity to the attention and MLP blocks of the transformer as a baseline. For summarization, we use a T5(-base) model and again \(l_1\) as a baseline.

When varying the number of heads, CoFI does not provide an easy way to control for this number, and hence we compare against a specialized head pruning method called Differential Subset Pruning (DSP) [29], another SOTA head pruning method. Here too, code is available for BERT, but not for T5 or OPT, and hence we compare NeuroPrune on BERT(-base) for the GLUE tasks. We do not show head removal results for the other models as there are no natural baselines like we had for sparsity (\(l_1\)). For machine translation, we use an 18-layer encoder-decoder model with 6 heads per layer as done in [29].

Metrics: For performance we report accuracy (Acc) for the GLUE datasets except COLA where Mathew’s Correlation (MCorr) is a standard metric, Rouge for summarization and BLEU scores for machine translation. We also report (average) inference and train times.

Experimental details such as the hyper-parameter values explored for each method, the number of epochs they are run, and the hardware used are in the appendix.

5.1 Varying sparsity percentage↩︎

We report results for sparsity percentages of \(25\), \(50\), \(70\), \(80\), \(90\) and \(95\) for GLUE and summarization.

GLUE: In Figure 7 we see the behavior of our method on GLUE datasets for BERT, T5, and OPT models. We see that our method is always competitive with CoFI, outperforming it on the smaller datasets. This is possible because we do not add extra variables to the model, which when coupled with the topological constraints results in stabler performance. The structured sparsity also gives improved inference times and the train time is much lower, since not only do we need to run our method only for a few epochs, but the time per epoch is the same as standard fine-tuning.

For T5 and OPT we compare with \(l_1\) sparsification. As can be seen, NeuroPrune is better than \(l_1\) in most cases w.r.t. performance, especially for in-between sparsities. We believe this happens because NeuroPrune can choose between attention or MLP (\(\alpha\), \(\beta\) parameters), on which to sparsify more when optimizing performance even though the constraints are structured. We also see benefits in inference time, again possibly because of the structured sparsification. Train times are similar as we run \(l_1\) for the same number of epochs as NeuroPrune and its per-epoch time is similar to that of standard fine-tuning.

Summarization: For summarization we see similar qualitative behavior of NeuroPrune vs \(l_1\) sparsity for the T5 model. NeuroPrune is best on the Rouge metrics and its inference time is also slightly better. The train times are again similar.

5.2 Varying number of heads↩︎

Beyond sparsity, we now observe the behavior of NeuroPrune w.r.t. the number of heads pruned. We roughly keep \(10\), \(25\), \(50\), \(75\), \(100\), \(125\) and \(144\) (i.e. all) heads. DSP is a strong competitor here.

GLUE: As can be seen we generally perform better or similar to DSP, but rarely worse. We believe this is because we have an additional knob of parameter sparsification, which can make heads similar as sparsity increases and given our novel redundancy-based head pruning algorithm we can effectively replace these heads with others keeping the overall performance of the model largely intact. The training time for our method is also significantly better since we can achieve the necessary head pruning in a few epochs, in addition to the fact that each epoch takes similar time as standard fine-tuning. The inference time is better, but that improvement may be reduced if the DSP code explicitly removed heads rather than just masking them2.

Machine translation: We see in Figure 9 (left) that NeuroPrune outperforms DSP on a machine translation task and with higher levels of sparsity. Figure 9 (right) illustrates efficient frontiers for both NeuroPrune and DSP; each curve offers a variety of models (varying sparsity and number of heads present) that achieve roughly similar performance. NeuroPrune clearly dominates DSP in terms of sparsity. DSP was run for varying numbers of heads present compared with varying sparsity parameters for NeuroPrune which generated more models. Unlike with GLUE tasks, we implemented NeuroPrune regularizations and head pruning within the DSP codebase which uses the fairseq toolkit [39] and used head masking for pruning. Since attention heads are still attached to the model, inference and train times between the methods on this task were similar; however note that the sparsity led to faster convergence, i.e., significantly better performance, for NeuroPrune when the number of heads present was greater than 70. NeuroPrune also has the advantage that it can be adapted to any architecture, whereas DSP must modify an architecture to include gates. Note that [29] show results from [5] (which significantly underperforms) and [28] (which has similar performance but again requires direct architecture modification).

a

b

Figure 9: Performance (left) and Efficient Frontier (right) for NeuroPrune and DSP on a German to English translation task. NeuroPrune consistently outperforms DSP across models with varying numbers of attention heads present and typically with much high levels of sparsity. The efficient frontier shows that for similar levels of performance, NeuroPrune finds much sparser models for a fixed number of heads present..

5.3 Other Insights↩︎

Topological sparsity: As seen in Figure 1 (and Figures 10, 11, 12 in the appendix) we see that our constraints try to eliminate neurons both in the MLP layers as well as the attention layers. This is observed as a row/column (sparsity) pattern in these matrices. Interestingly, sometimes only the \(Q\) and \(K\) entries in a row are sparsified, although the group sparsity constraint is applied to \(QKV\) jointly. The similarity in sparsification patterns for matrices \(Q\) and \(K\) directly reflects the symmetry of the operations they are jointly involved in, when computing the attention coefficients as inner products of each row of \(Q\) with each row of \(K\), compactly as \(Q K^\top\), unlike \(V\) which is simply a linear projection of the token embeddings. The MLP portions, as seen in Figure 1, exhibit preferential attachment with increasing sparsity which is consistent with brain functional networks.

Importance of removed heads: In Figure 14 in the appendix, we see the relative importance of removed heads, defined as the ratio between the importance of removed heads in a layer to the ones that remain averaged across all the layers. The individual head importances are computed as the sum of the absolute output dense weights that emerge from a head. We see that NeuroPrune can prune more important heads than DSP as it finds redundant heads, thereby sparsifying more aggressively, which is evidenced by the faster training time on GLUE tasks for similar levels of head removal.

Head removal: As seen in Figure 8 there is a high bias to keep the last head in each layer. This is because NeuroPrune prefers keeping later heads in a layer if it is similar to a set of heads that another head may be similar to, thus encouraging more modularity in a layer. In Figure 13 in the appendix, we see that later layers, and often the first layer, lose more heads than the intermediate ones, which is consistent with [29].

6 Discussion↩︎

This work shows that NeuroPrune, inspired by sparsity, modularity, and preferential attachment topology in brain functional networks, is competitive with other SOTA dynamic sparse training methods, even though it does not solely try to optimize performance. It is also more efficient than them in train time with speed-ups also seen in inference.

There are multiple avenues for future research. First, it would be interesting to combine our redundancy-based and head-importance pruning methods to produce even more aggressive and efficient pruning of heads. Second, the structured strategies we used for fine-tuning could also be tried during pre-training. Third, one could test the generalizability of models trained using NeuroPrune on related, but different tasks and measure if similar gains can be secured. We believe our work will spur progress on efficient architectures.

Limitations↩︎

All the datasets we considered were for the English language. Results may vary for other languages. We applied our method to three LLMs, but more architectures could be tested in the future not to mention more tasks. Our method although easy to adapt to different architectures while also being efficient, it does not allow the user to specify the exact number of heads one wants to prune and architectures are limited to Transformer based LLMs. This is implicitly a function of the threshold \(\theta\) and the similarity of the pre-trained/fine tuned heads as well as the sparsity. We also have three hyper-parameters (\(\alpha\), \(\beta\) and \(\theta\)) that need to be specified for each run.

Ethics Statement↩︎

Our work could be used to dynamically sparsify other LLMs and models, while fine tuning or pre-training. The sparsification may result in reduced alignment if the LLM is aligned to certain values and especially if those values are not encapsulated by the loss function that is used to fine tune using our method. So although one may use the method to create smaller models one has to be cognizant of what aspects may have been lost in the process. The method is easy to adapt to transformer based models and hence could possibly be widely used but improvements such as being able to specify number of heads etc. might be beneficial in future versions.

7 Experimental Details↩︎

Hardware: All experiments were conducted on an NVIDIA A100 GPU with 40 GB memory.

Setup: NeuroPrune results were obtained by varying \(\alpha\) and \(\beta\) from \(10^{-7}\) to \(0.1\) in multiples of \(10\). The \(l_1\) penalty parameter also took these values. \(\theta\) took values in \(\{0.15, 0.2, 0.25\}\) for the head removal experiments where the default was set to \(0.15\) for the GLUE and summarization tasks. It was sufficient to run NeuroPrune for a single epoch for the larger GLUE datasets (viz. MNLI, QQP, QNLI and SST2) and the summarization task, while for the other smaller GLUE datasets we ran it for \(3\) to \(5\) epochs. \(\epsilon\) for MLP sparsification was set to \(10^{-4}\). CoFI finetunes the model before it starts pruning. For smaller GLUE datasets, finetuning before pruning epochs is \(4\) and total epochs is \(100\), while for larger datasets, these numbers are \(2\) and \(20\) respectively. All the other parameters were unchanged. To make the comparison fair, we turned off the distillation option. DSP runs 3 epochs of joint (finetuning and mask learning) training. For machine translation, NeuroPrune results were obtained with \(\alpha\in\{0.005, 0.01, 0.05, 0.1, 0.25, 0.5\}\), \(\beta\in\{0.05, 0.1\}\) and \(\theta\) varying from 0.1 to 0.4 in increments of 0.02. \(\epsilon\) was set as in GLUE. Since there was no pretrained model, 15 epochs of pretraining were done followed by 15 epochs of NeuroPrune finetuning for a total of 30 epochs. DSP joint training was also done with a total of 30 epochs.

a

b

c

d

Figure 10: Attention layers for BERT model where, top three rows correspond to standard fine tuning, the next sets of three correspond to \(\approx 25\%\), \(\approx 50\%\) and \(\approx 90\%\) sparsity using NeuroPrune. As can be seen NeuroPrune encourages preferential attachment topology..

a

b

c

d

Figure 11: MLP_in layers for BERT model where, top three rows correspond to standard fine tuning, the next sets of three correspond to \(\approx 25\%\), \(\approx 50\%\) and \(\approx 90\%\) sparsity using NeuroPrune. As can be seen NeuroPrune encourages preferential attachment topology..

a

b

c

d

Figure 12: MLP_out layers for BERT model where, top three rows correspond to standard fine tuning, the next sets of three correspond to \(\approx 25\%\), \(\approx 50\%\) and \(\approx 90\%\) sparsity using NeuroPrune. As can be seen NeuroPrune encourages preferential attachment topology..

Figure 13: We see the (average) # of heads removed per layer using NeuroPrune when fine tuning on the GLUE Benchmark datasets with a BERT(-base) model for cases where at least 10 heads are removed. As can be seen more heads are pruned from the later layers and the first one as compared to the middle layers.

Figure 14: Relative importance of removed heads for NeuroPrune and DSP averaged across runs, where at least 10 heads are removed. We see that NeuroPrune removes more important heads than DSP does because of its redundancy based head elimination algorithm.

8 Additional Figures↩︎

In Figures 10, 11 and 12 we sparsity patterns induced by NeuroPrune in a BERT-base model at different sparsity percentages on the SST2 dataset.

In Figure 13, we see the number of heads removed on average per layer over the GLUE benchmark. As can be seen, the last three layers and the first layer undergo most pruning.

In Figure 14, we see the relative head importance of removed to kept heads for NeuroPrune and DSP. As can be seen our redundancy based pruning removes more important heads on average.

References↩︎

[1]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems.
[2]
Ziming Liu, Eric Gan, and Max Tegmark. 2024. Seeing is believing: Brain-inspired modular training for mechanistic interpretability. Entropy, 26(1).
[3]
J. Liu, Z. Xu, R. Shi, R. C. C. Cheung, and H. K. H. So. 2020. Dynamic sparse training: Find efficient sparse netowrk from scratch with trainable masked layers. In ICLR.
[4]
H. Shi, J. Gao, X. Ren, H. Xu, X. Liang, Z. Li, and J. T. Kwok. 2020. Sparsebert: Rethinking the importance analysis in self-attention. In ICML.
[5]
Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? In Advances in Neural Information Processing Systems, volume 32.
[6]
Mengzhou Xia, Zexuan Zhong, and Danqi Chen. 2022. https://doi.org/10.18653/V1/2022.ACL-LONG.107. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 1513–1528. Association for Computational Linguistics.
[7]
Ye Yuan, Jian Liu, Peng Zhao, Fu Xing, Hong Huo, and Tao Fang. 2019. Structural insights into the dynamic evolution of neuronal networks as synaptic density decreases. Frontiers in Neuroscience, 13:892.
[8]
Christopher W Lynn, Caroline M Holmes, and Stephanie E Palmer. 2024. Heavy-tailed neuronal connectivity arises from hebbian self-organization. Nature Physics, pages 1–8.
[9]
Hao Wang, Hong Liu, and Zhong wei Zhang. 2011. Elimination of redundant synaptic inputs in the absence of synaptic strengthening. Journal of Neuroscience, 31(46):16675–16684.
[10]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 24th International Conference on Learning Representations.
[11]
Petra E Vértes, Aaron F Alexander-Bloch, Nitin Gogtay, Jay N Giedd, Judith L Rapoport, and Edward T Bullmore. 2012. Simple models of human brain functional networks. Proceedings of the National Academy of Sciences, 109(15):5868–5873.
[12]
Gal Chechik, Isaac Meilijson, and Eytan Ruppin. 1999. Neuronal regulation: A biologically plausible mechanism for efficient synaptic pruning in development. Neurocomputing, 26-27:633–639.
[13]
Kouichi Hashimoto and Masanobu Kano. 2013. Synapse elimination in the developing cerebellum. Cellular and Molecular Life Sciences, 70:4667–4680.
[14]
Jeff W. Lichtman and Howard Colman. 2020. Synapse elimination and indelible memory author links open overlay panel. Neuron, 25(2):269–278.
[15]
Robert B. Allan and Renu Laskar. 1978. On domination and independent domination numbers of a graph. Discrete Mathematics, 23(2):73–76.
[16]
Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Stephen Gou, Phil Blunsom, Ahmet Üstün, and Sara Hooker. 2023. Intriguing properties of quantization at scale. arXiv preprint arXiv:2305.19268.
[17]
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2023. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543.
[18]
Elias Frantar and Dan Alistarh. 2023. https://proceedings.mlr.press/v202/frantar23a.html. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 10323–10337. PMLR.
[19]
Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. 2023. https://doi.org/10.48550/ARXIV.2306.11695. CoRR, abs/2306.11695.
[20]
Song Han, Huizi Mao, and William J. Dally. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR.
[21]
Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, and Surya Ganguli. 2020. https://proceedings.neurips.cc/paper/2020/hash/46a4378f835dc8040c8057beb6a2da52-Abstract.html. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
[22]
Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. 2019. https://openreview.net/forum?id=B1VZqjAcYX. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
[23]
Jonathan Frankle and Michael Carbin. 2019. https://openreview.net/forum?id=rJl-b3RcF7. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
[24]
Sai Prasanna, Anna Rogers, and Anna Rumshisky. 2020. https://doi.org/10.18653/V1/2020.EMNLP-MAIN.259. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 3208–3229. Association for Computational Linguistics.
[25]
Xiaohan Chen, Yu Cheng, Shuohang Wang, Zhe Gan, Zhangyang Wang, and Jingjing Liu. 2021. https://doi.org/10.18653/V1/2021.ACL-LONG.171. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 2195–2207. Association for Computational Linguistics.
[26]
Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. 2017. https://doi.org/10.1145/3005348. ACM J. Emerg. Technol. Comput. Syst., 13(3):32:1–32:18.
[27]
Angela Fan, Edouard Grave, and Armand Joulin. 2020. https://openreview.net/forum?id=SylO2yStDr. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
[28]
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. https://doi.org/10.18653/V1/P19-1580. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 5797–5808. Association for Computational Linguistics.
[29]
Jiaoda Li, Ryan Cotterell, and Mrinmaya Sachan. 2021. https://doi.org/10.1162/TACL_A_00436. Trans. Assoc. Comput. Linguistics, 9:1442–1459.
[30]
Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. 2020. http://arxiv.org/abs/2004.03844. CoRR, abs/2004.03844.
[31]
François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M. Rush. 2021. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.829. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 10619–10629. Association for Computational Linguistics.
[32]
Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. 2012. . Statistical Science, 27(4):450 – 468.
[33]
Yaohua Hu, Chong Li, Kaiwen Meng, Jing Qin, and Xiaoqi Yang. 2017. Group sparse optimization via lp,q regularization. Journal of Machine Learning Research, 18(30):1–52.
[34]
Léon Bottou, Frank E. Curtis, and Jorge Nocedal. 2018. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311.
[35]
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. https://doi.org/10.18653/v1/K16-1028. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
[36]
Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. 2014. Report on the 11th IWSLT evaluation campaign. In Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign, pages 2–17.
[37]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
[38]
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv:2205.01068.
[39]
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53.

  1. For MNLI we report the matched dev set accuracies.↩︎

  2. We use HuggingFace’s prune_heads() function for this.↩︎