Prompt-prompted Mixture of Experts
for Efficient LLM Generation
Harry Dong^{1}
CMU
Beidi Chen
CMU
Yuejie Chi
CMU
April 01, 2024
With the development of transformer-based large language models (LLMs), they have been applied to many fields due to their remarkable utility, but this comes at a considerable computational cost at deployment. Fortunately, some methods such as pruning or constructing a mixture of experts (MoE) aim at exploiting sparsity in transformer feedforward (FF) blocks to gain boosts in speed and reduction in memory requirements. However, these techniques can be very costly and inflexible in practice, as they often require training or are restricted to specific types of architectures. To address this, we introduce GRIFFIN, a novel training-free MoE that selects unique FF experts at the sequence level for efficient generation across a plethora of LLMs with different non-ReLU activation functions. This is possible due to a critical observation that many trained LLMs naturally produce highly structured FF activation patterns within a sequence, which we call flocking. Despite our method’s simplicity, we show with 50% of the FF parameters, GRIFFIN maintains the original model’s performance with little to no degradation on a variety of classification and generation tasks, all while improving latency (e.g. 1.25\(\times\) speed-up in Llama 2 13B on an NVIDIA L40). Code will be available at https://github.com/hdong920/GRIFFIN.
Transformers [1] have demonstrated incredible capabilities across a plethora of domains [2]–[4]. Their large language model (LLM) successors [5]–[10] have pushed the bar higher, but these behemoths have become performative at the price of enormous amounts of computation and memory demands. One significant contributor is the model size itself. Not only is storage an issue, model layers tend to be wide and plenty, slowing down inference. Moreover, given the existence of sparse structures in LLMs, especially in feedforward (FF) blocks [11]–[14], these models waste computation on intermediate features with little to no impact on the final result. For instance, it has been observed that in OPT-175B [15], fewer than 5% of neurons in FF blocks have nonzero values per token [14], meaning 95% of the compute in each FF block is wasted. Usually consisting of around two-thirds of the parameters in an LLM, FF blocks can be serious memory and compute bottlenecks. These inefficiencies are highly problematic in latency-sensitive scenarios like in chatbots and autonomous vehicles.
There have been many methods to exploit sparsity in LLMs for efficiency gains, such as pruning model weights and constructing mixtures of experts (MoEs). Pruning removes low-impact pre-trained weights to reduce storage, yet this often does not translate to real speed-ups in practice, unless the pruning is done in a hardware-friendly manner which typically causes greater performance deterioration. MoEs better preserve the original performance by adaptively selecting subsets of the model to use per input but also come with drawbacks. Unless the model has been trained in this fashion [8], [16], it will need to learn a cheap yet effective gating function (expert selection mechanism) and sometimes even require full fine tuning. Perhaps an even bigger weakness of many of these methods is the inability to effectively carry over to pre-trained LLMs with non-ReLU activations. We seek to overcome these challenges with MoEs:
Most current LLMs are not MoEs, so they must be adapted by training gating functions and/or fine tuning the whole model.
Gating functions should be simple enough to not cost more than they save.
Many current approaches work solely on FF blocks with ReLUs or depend on expensive conversions to ReLUs.
Daunting at first, these challenges become surmountable because of a simple observation of a phenomenon we call flocking, highly consistent sparse activations that persist throughout a sequence observed in many LLMs. Flocking emerges in FF activations (inputs into the FF down projection) when we focus on a sequence’s relative activation magnitudes instead of the absolute values. Examples from Llama 2 7B and Gemma 7B are shown in Figure 1. The key takeaway is that neurons which produce high relative magnitudes are naturally shared across tokens within a sequence, as seen with the distinct vertical streaks. Increasingly bizarre, the two models have different architectures and different non-ReLU activation functions.
Unlike existing pruning or MoE methods, we exploit flocking in our design of GRIFFIN (Gating by Repetition In Feedforward Intermediate Neurons), a highly performative and efficient training-free method to turn an LLM into an MoE. GRIFFIN does this by using a sequence’s prompt to determine the experts to activate during generation, allowing it to overcome all of the aforementioned challenges:
No Preparation: Our no-cost method is completely training-free and requires no preparation. Moreover, the simple implementation of GRIFFIN means it can easily be dropped into any FF block.
Simple Expert Selection: Since flocking exists throughout a sequence, the prompt reveals the most relevant FF neurons for generation with little to no performance loss. The selection process is parameter-free and adds negligible overhead.
Model & Activation Function Diversity: Thorough experimentation demonstrates the efficacy of GRIFFIN on numerous models, including Llama 2 [5], Gemma [9], Mistral [7], OPT [15], and ReluLlama [19]. Together, the tested activation functions include ReLU, SwiGLU, GEGLU, and ReGLU [20].
In this paper, we show GRIFFIN is a simple and strong method to turn any LLM into MoE because of flocking. In the next section (Section 2), we discuss some strengths and weaknesses of current methods that seek to improve FF efficiency. In more detail, we formalize the MoE problem and its motivation in Section 3 and present our novel approach in Section 4.2, which requires a thorough examination of the surprising phenomenon of flocking shared by many LLMs in Section 4.1. Our rigorous experiments demonstrate GRIFFIN preserves performance on classification and generation even after removing 50% of FF neurons (Section 5.1), all while having lower latency (Section 5.2). For instance, GRIFFIN reduces the number of active parameters in Llama 2 13B from 13B to 8.8B during generation to improve latency by 1.25\(\times\) with almost no loss in performance. Finally, we show our method’s incredible scalability and robustness in several ablation studies (Section 5.3).
Our novel method and FF activation observations are inspired and motivated by ample amounts of previous research that sought to characterize FF sparsity and accelerate LLMs.
The observation that transformer FF blocks produce sparse activations is not new [11]–[14], [21]. In ReLU-based LLMs like OPT [15], the activations can be exceptionally sparse and become more apparent for larger models [14]. As more models use non-sparse activation functions like GLU variants [20], it is difficult for neurons to have no contribution to the output since these functions do not have an interval that maps to zero. Without exact sparsity, the efficacy of these methods becomes limited. As such, this has ushered a wave of models that are either adapted from available models [8], [22]–[25] or trained from scratch [16] which can produce activations that are exactly zero with little to no performance loss. Even so, these methods require considerable amounts of computational resources.
Pruning [26] is another sparsity-guided way to tackle compute and memory bottlenecks of models. Previously, the common method would be some variation of iteratively rounding weights down to zero based on some score and retraining to recover any lost performance [27]–[30]. While this can result in most parameters being pruned, this method comes with a few issues. First, with the increasing scale of LLMs, retraining becomes impractical for most. Fortunately, cheap methods to effectively prune LLMs have been developed [31]–[34]. The second issue is that unless pruning is done in a structured manner [35]–[39], it is difficult to see real computational savings, yet structured pruning often leads to much more severe performance degradation. Third, pruning enforces sparsity to be static which can be strong assumption since FF blocks are widely believed to contain the model’s memory [11].
Making sparsity more dynamic has motivated the design of mixture-of-experts (MoEs) [40] to avoid computing low-impact features in trained models at varying granularities [14], [22], [25], [41]–[44]. The main idea of MoEs is to use a gating function to identify a small subset of neurons that will be used to compute the layer’s output, an adaptive form of pruning. In the ideal case, all active neurons are selected and inactive neurons are ignored for each input. However, current methods either require training or rely on ReLU activation functions. Our method has the best of both worlds: it is training-free and effective on a variety of activation functions.
This section contains an overview of different components of the FF block followed by a more detailed introduction of the MoE problem which our method aims to tackle. Since FF blocks operate identically and independently for each token unlike attention, we begin with defining the FF block with a single column vector input \(\boldsymbol{x}\in \mathbb{R}^{D}\): \[\begin{align} \text{FF} (\boldsymbol{x}) = \text{FF}_2(\underbrace{\text{FF}_1(\boldsymbol{x})}_{\boldsymbol{z}}) \end{align}\] where \(\text{FF}_2 (\boldsymbol{z}) = \boldsymbol{W}_2 \boldsymbol{z}+ \boldsymbol{b}_2\) is a linear transformation and \(\text{FF}_1\) is nonlinear. This describes a wide range of FF architectures and arbitrary activation functions \(\sigma\). For instance, in OPT, \[\begin{align} \text{FF}_1(\boldsymbol{x}) &= \sigma(\boldsymbol{W}_1 \boldsymbol{x}+ \boldsymbol{b}_1). \end{align}\] For FF blocks with GLU variants such as in Llama 2 and Gemma, \[\begin{align} \text{FF}_1(\boldsymbol{x}) &= \sigma(\boldsymbol{W}_g \boldsymbol{x}+ \boldsymbol{b}_g) \odot (\boldsymbol{W}_1 \boldsymbol{x}+ \boldsymbol{b}_1) \end{align}\] where \(\odot\) signifies element-wise multiplication. For all examples, \(\boldsymbol{W}_1, \boldsymbol{W}_g \in \mathbb{R}^{D_{\text{FF}} \times D}\) and \(\boldsymbol{W}_2 \in \mathbb{R}^{D \times D_{\text{FF}}}\) where typically, \(D_{\text{FF}} \gg D\). We refer to \(\boldsymbol{z}= \text{FF}_1(\boldsymbol{x})\) as the FF activations. The goal of MoEs is to find \(\widehat{\boldsymbol{W}}_1 \in \mathbb{R}^{k \times D}\), \(\widehat{\boldsymbol{b}}_1 \in \mathbb{R}^{k}\), and \(\widehat{\boldsymbol{W}}_2 \in \mathbb{R}^{D \times k}\) (additionally \(\widehat{\boldsymbol{W}}_g \in \mathbb{R}^{k \times D}\) and \(\widehat{\boldsymbol{b}}_g \in \mathbb{R}^{k}\) if needed) where \(k < D_{\text{FF}}\) such that when the FF block is reparameterized with these matrices, the output value is preserved. In other words, for \[\begin{align} \widehat{\boldsymbol{z}} = \widehat{\text{FF}}_1(\boldsymbol{x}) &= \sigma(\widehat{\boldsymbol{W}}_g \boldsymbol{x}+ \widehat{\boldsymbol{b}}_g) \odot (\widehat{\boldsymbol{W}}_1 \boldsymbol{x}+ \widehat{\boldsymbol{b}}_1), \\ \widehat{\text{FF}}_2(\widehat{\boldsymbol{z}}) &= \widehat{\boldsymbol{W}}_2 \widehat{\boldsymbol{z}} + \boldsymbol{b}_2, \end{align}\] \(\text{FF}(\boldsymbol{x}) \approx \widehat{\text{FF}}_2(\widehat{\text{FF}}_1(\boldsymbol{x}))\), and similarly for FF blocks with non-GLU functions. In the MoE setting, these smaller matrices can vary from token to token and are actually selections of rows and columns of the original structures. Crucially, this selection leads to multiplication with smaller matrices which are naturally efficient on GPUs and TPUs [45], [46]. For all equations defined in this section, they operate independently on each row of a length \(S\) sequence input \(\boldsymbol{X}\in \mathbb{R}^{S \times D}\) (e.g. the activations for a sequence are \(\boldsymbol{Z}= \text{FF}_1(\boldsymbol{X}) \in \mathbb{R}^{S \times D_{\text{FF}}}\)).
Here, we take deeper dive into the phenomenon of flocking and describe the intuitive algorithm of GRIFFIN which is directly inspired by it.
Flocking arises when we look at the relative impact of each neuron per token within a sequence. To see this, we normalize rows of \(\boldsymbol{Z}\) to be unit vectors to construct \(\overline{\boldsymbol{Z}} \in \mathbb{R}^{S \times D_{\text{FF}}}\) (i.e. \([\overline{\boldsymbol{Z}}]_i = [\boldsymbol{Z}]_i / \|[\boldsymbol{Z}]_i\|_2\)), the relative activations. We show example relative activation magnitudes for a sequence in Llama 2 7B and Gemma 7B in Figure 1. Since there are distinct vertical streaks, this intriguingly implies that activations that have relatively greater weight are common across all tokens in a sequence. Notably, Llama 2 7B and Gemma 7B use SwiGLU and GEGLU activations, respectively, along with other major architecture differences. We call this phenomenon flocking, like highly organized groups of birds, and we observe this in virtually all FF layers (see Appendix 8).
While relative activations magnitudes are shared within a sequence, they are not generally shared between sequences. We show this by taking the \(\ell_2\)-norm of \(\overline{\boldsymbol{Z}}\) along the token axis to obtain a length \(D_{\text{FF}}\) vector for each sample or sequence, roughly capturing the contribution of each FF neuron throughout a sequence. Taking the top-\(k\) of this for each sample at each layer, we find the Jaccard similarity between two sequences based on the indices selected for different \(k\). In other words, we compute the intersection over union of every unique pair of top-\(k\) sets. Higher values indicate more similar top-\(k\) sets. From Figure 2 where we aggregate Jaccard similarities across WikiText [47] samples, we observe a lack of inter-sample activation similarities for the vast majority of layers in Llama 2 7B and Gemma 7B, unless the sets of selected neurons are large. This lack of consistency implies pruning entire FF neurons without retraining would be less effective than a more adaptive method.
Using our insight on flocking, we introduce GRIFFIN as a simple general purpose and training-free MoE method for efficient generation, captured in Figure 3. In a nutshell, we select experts during the prompt phase of each sample which are then used for the entire duration of the generation phase. This effective approach is based on a key observation on flocking: since tokens within a sequence share activation patterns, the prompt and generated tokens will also share activation patterns.
Our experts or neurons are chosen at the sequence level, so we need to consider the dynamics of the entire input sequence rather than just a single token when choosing our experts. To select expert neurons, we need a statistic \(\boldsymbol{s}\in \mathbb{R}^{D_{\text{FF}}}\) to inform us of the importance of each neuron. At the prompt phase, we do this by aggregating taking the \(\ell_2\)-norm of \(\overline{\boldsymbol{Z}}\) along the token axis: \[\begin{align} \boldsymbol{s}&= \begin{bmatrix} \|[\overline{\boldsymbol{Z}}]_{\cdot, 1} \|_2 &\cdots &\|[\overline{\boldsymbol{Z}}]_{\cdot, D} \|_2 \end{bmatrix}^\top \label{eq:s}. \end{align}\tag{1}\] Taking the indices of the top-\(k\) across \(\boldsymbol{s}\) gives us the neurons we will use for this sample’s generation phase which make up the set \(\mathcal{E}\). Using the experts in \(\mathcal{E}\), we can find \(\widehat{\boldsymbol{W}}_1, \widehat{\boldsymbol{b}}_1, \widehat{\boldsymbol{W}}_g, \widehat{\boldsymbol{b}}_g, \text{ and } \widehat{\boldsymbol{W}}_2\) by selecting corresponding rows and columns in \(\boldsymbol{W}_1, \boldsymbol{b}_1, \boldsymbol{W}_g, \boldsymbol{b}_g\), and \(\boldsymbol{W}_2\), respectively. This is done for every FF block during the prompt phase. Elaborated in Appendix 7, \(\boldsymbol{s}\) highlights neurons consistently activated at relatively high intensities.
When generating tokens, we directly use the pruned layers which contain the expert neurons, \(\widehat{\text{FF}}_1\) and \(\widehat{\text{FF}}_2\), to estimate \(\text{FF}(\boldsymbol{X}) \approx \widehat{\text{FF}}_2(\widehat{\text{FF}}_1(\boldsymbol{X}))\) for all future tokens. In Llama 2 13B and Gemma 7B, this reduces the active number of parameters from 13B to 8.8B and from 8.5B to 5.4B, respectively, during generation.
We showcase the superb performance of GRIFFIN on numerous tasks and models (Section 5.1) while achieving lower latency (Section 5.2), along with a study on several of its properties like sampling experts, sequence length scaling, and random inputs (Section 5.3). All experiments are run on NVIDIA L40 GPUs.
Model | HellaSwag | PIQA | COPA | ARC-e | ARC-c | BoolQ |
---|---|---|---|---|---|---|
Llama 2 7B | 57.16 | 78.07 | 87.00 | 76.35 | 43.34 | 77.71 |
Magnitude | 57.12 | 77.31 | 84.00 | 70.33 | 40.27 | 66.54 |
GRIFFIN | 57.11 | 77.69 | 86.00 | 74.54 | 42.75 | 73.15 |
Llama 2 13B | 60.06 | 79.05 | 90.00 | 79.46 | 48.46 | 80.61 |
Magnitude | 60.00 | 79.00 | 88.00 | 74.07 | 46.25 | 70.52 |
GRIFFIN | 60.10 | 79.11 | 89.00 | 77.19 | 46.84 | 78.50 |
Gemma 7B | 60.61 | 80.30 | 88.00 | 82.74 | 50.09 | 83.49 |
Magnitude | 46.24 | 73.12 | 57.00 | 45.20 | 32.76 | 62.84 |
GRIFFIN | 60.62 | 79.98 | 88.00 | 81.65 | 50.09 | 81.90 |
Mistral 7B | 61.21 | 80.58 | 92.00 | 80.89 | 50.43 | 83.61 |
Magnitude | 61.15 | 80.36 | 86.00 | 74.20 | 48.89 | 60.40 |
GRIFFIN | 61.18 | 80.52 | 91.00 | 79.25 | 50.00 | 80.06 |
OPT 6.7B | 50.48 | 76.28 | 81.00 | 65.53 | 30.55 | 66.12 |
Magnitude | 49.21 | 72.63 | 79.00 | 47.60 | 27.13 | 40.15 |
GRIFFIN | 50.44 | 75.63 | 80.00 | 63.93 | 30.55 | 65.44 |
OPT 13B | 52.46 | 75.90 | 86.00 | 67.05 | 32.94 | 65.81 |
Magnitude | 51.31 | 74.21 | 81.00 | 49.41 | 28.07 | 38.75 |
GRIFFIN | 52.42 | 76.17 | 86.00 | 66.92 | 33.19 | 67.65 |
Using various models, we evaluate on several generation and classification tasks. For generation, we evaluate on XSum [48], CNN/DailyMail [49], COQA [50], and SCROLLS QASPER [51], [52]. For classification, we evaluate on HellaSwag [53], PIQA [54], COPA [55], ARC-Easy/Challenge [56], and BoolQ [57]. With the exception of XSum and CNN/DailyMail, we use LM Evaluation Harness for our experiments [58]. Aside from comparing with the original LLM, we also compare GRIFFIN with a static sequence-level MoE based on neuron magnitudes. Similar to neuron magnitude pruning, this baseline selects experts based on neuron magnitudes in \(\boldsymbol{W}_1\) for the generation phase but uses the entire FF blocks for prompting like GRIFFIN. In the case of GLU variants, the neuron-wise norms of \(\boldsymbol{W}_1\) and \(\boldsymbol{W}_g\) are elementwise multiplied to produce the pruning metric. As we will see, this straightforward baseline achieves great classification results but falters for generation.
As our method is designed specifically for generation, we alter classification evaluations to simulate generation. In typical classification tasks, LLMs do not enter the generative phase since the final token output of the prompt phase indicates the class. Consequently, directly applying GRIFFIN for classification tasks trivially yields the exact performance of the original model. Therefore, we treat all tokens but the final input token as the prompt. Then, the model is forced to go into the generation phase for one step to produce the class.
We start with a look into the relationship between the sparsity levels and performance degradation. This translates to varying \(k\) when we select the top-\(k\) of our statistic \(\boldsymbol{s}\). To compare the performance degradation across multiple tasks, we plot the ratio of the final performance metrics between GRIFFIN and the full model in Figure 4. We see most of the performance is preserved at 50% FF sparsity in Llama 2 7B, Gemma 7B, and Mistral 7B. Different tasks have different tipping points where performance sharply drops, which may be related to the difficulty of the task [59].
Model | XSum | CNN/DailyMail | CoQA | QASPER |
(Rouge-1/2/L) | (Rouge-1/2/L) | (F1/EM) | (F1) | |
Llama 2 7B | 27.15/9.06/22.62 | 10.08/0.13/9.55 | 77.35/63.88 | 26.31 |
Magnitude | 9.71/1.31/8.59 | 9.66/0.63/9.32 | 56.59/39.93 | 12.93 |
GRIFFIN | 24.75/7.41/20.55 | 10.97/0.66/10.37 | 77.18/63.58 | 25.76 |
Llama 2 13B | 26.90/9.45/22.09 | 2.51/0.22/2.34 | 79.18/66.37 | 28.32 |
Magnitude | 5.72/0.78/5.06 | 0.02/0.00/0.02 | 65.69/47.87 | 15.55 |
GRIFFIN | 25.69/7.85/20.89 | 3.31/0.78/3.07 | 79.22/66.62 | 27.91 |
Gemma 7B | 26.86/9.15/22.03 | 17.45/4.15/15.94 | 79.04/65.25 | 30.78 |
Magnitude | 1.49/0.01/1.47 | 0.00/0.00/0.00 | 2.92/1.50 | 7.02 |
GRIFFIN | 25.86/7.81/20.93 | 18.26/4.75/16.58 | 78.52/64.62 | 27.37 |
Mistral 7B | 28.67/10.21/23.64 | 0.28/0.01/0.28 | 80.70/67.30 | 24.56 |
Magnitude | 3.58/0.27/3.31 | 0.26/0.03/0.26 | 61.99/45.93 | 17.18 |
GRIFFIN | 26.59/8.70/22.17 | 1.26/0.21/1.17 | 80.15/66.50 | 23.92 |
OPT 6.7B | 23.60/7.04/19.46 | 13.85/1.54/13.04 | 68.70/54.98 | 18.53 |
Magnitude | 1.63/0.00/1.54 | 1.20/0.00/1.17 | 31.53/16.52 | 7.28 |
GRIFFIN | 21.17/5.42/17.58 | 13.01/1.06/12.26 | 68.99/55.00 | 17.40 |
OPT 13B | 25.14/7.93/20.80 | 13.22/1.18/12.46 | 69.51/55.67 | 20.58 |
Magnitude | 1.23/0.00/1.21 | 1.29/0.00/1.29 | 39.38/27.07 | 8.87 |
GRIFFIN | 22.11/6.28/18.29 | 12.92/1.13/12.20 | 69.07/54.83 | 20.16 |
ReluLlama 2 7B | 25.10/7.81/20.76 | 20.95/6.79/19.24 | 78.49/66.73 | 23.31 |
Magnitude | 9.09/0.22/8.20 | 8.50/0.14/8.17 | 19.43/6.48 | 7.21 |
GRIFFIN | 21.83/5.88/18.09 | 16.85/4.96/14.69 | 78.35/67.10 | 22.29 |
Fixing FF sparsity to be 50%, we evaluate on more tasks and models. Table 1 and Table 2 show the performance of GRIFFIN on classification and generation, respectively. We see that magnitude neuron pruning achieves reasonable results for classification but completely annihilates the original performance in most generation settings. In contrast, GRIFFIN achieves not only better performance than the baseline in most scenarios, but also preserves most of or matches the original performance on all tasks.
We now present efficiency metrics of GRIFFIN. We collect synthetic datasets with samples having identical lengths and average results across samples. Like many other MoE methods, GRIFFIN is ideal for single sample inputs, such as in the case of personal devices, so we use batch size 1 for these experiments. We plan to extend our method to larger batch sizes in future work. Using Hugging Face implementations of Llama 2 13B and Gemma 7B at FP16 precision, we measure the latency in different scenarios on an NVIDIA L40 GPU.
Model | Setup | Prompt | Full | Magnitude | GRIFFIN |
---|---|---|---|---|---|
Llama 2 13B | 2048+128 | 0.5 | 6.8 | 5.4 / 5.0 | 5.4 / 5.1 |
2048+2048 | 0.5 | 119.1 | 95.0 / 83.4 | 94.9 / 82.8 | |
Gemma 7B | 2048+128 | 0.3 | 4.5 | 4.1 / 4.2 | 4.2 / 4.1 |
2048+2048 | 0.3 | 78.5 | 67.7 / 65.0 | 67.4 / 65.0 |
Recalling that our magnitude selection baseline is essentially neuron pruning at generation, this has the best possible speed-up since there is no MoE overhead per sample. From Table 3, GRIFFIN matches the best case, producing up to a 1.16\(\times\) and 1.25\(\times\) improvement in latency for long generation at 50% FF sparsity in Gemma 7B and Llama 2 13B, respectively. This illustrates that our method is as fast as a static neuron pruned LLM during generation while being adaptive to preserve the accuracy of the full model. In offloading settings with large models, our method has the potential to further accelerate inference. For a prompt, GRIFFIN essentially performs structured pruning on the massive network, and if this pruned model can fit on a single device, it will avoid offloading for the entirety of generation.
Here, we verify that given the statistic \(\boldsymbol{s}\), top-\(k\) expert selection produces better results than sampling-based methods. The methods we compare against include sampling based on the weights in \(\boldsymbol{s}\) and combining top-\(k\) selection for half of the experts followed by weighted sampling. Based on Table 4, we can see that sampling generally degrades performance much more.
Selection Method | XSum | CNN/DailyMail | CoQA | QASPER | |
(Rouge-1/2/L) | (Rouge-1/2/L) | (F1/EM) | (F1) | ||
Llama 2 7B | |||||
Full | 27.15/9.06/22.62 | 10.08/0.13/9.55 | 77.35/63.88 | 26.31 | |
Top-\(k\) | 24.75/7.41/20.55 | 10.97/0.66/10.37 | 77.18/63.58 | 25.76 | |
Sampling | 21.04/5.22/17.12 | 8.78/0.49/8.28 | 76.15/62.53 | 24.46 | |
Top-\(k\) + Sampling | 23.13/6.19/19.13 | 10.01/0.47/9.50 | 76.66/62.93 | 25.07 | |
Gemma 7B | |||||
Full | 26.86/9.15/22.03 | 17.45/4.15/15.94 | 79.04/65.25 | 30.78 | |
Top-\(k\) | 25.86/7.81/20.93 | 18.26/4.75/16.58 | 78.52/64.62 | 27.37 | |
Sampling | 20.25/5.16/16.79 | 8.34/1.71/7.72 | 75.02/59.93 | 24.97 | |
Top-\(k\) + Sampling | 22.71/6.49/18.56 | 8.14/1.68/7.52 | 75.17/60.58 | 25.88 |
We find that GRIFFIN can potentially be made more robust for long generation by lengthening the prompt. To see this, we use language modeling on the concatenated version of WikiText to simulate generation. For a length \(S\) input into the FF block, we designate the first \(P\) tokens as the prompt and the last \(G\) tokens as the generated portion such that \(P + G = S\). The prompt partition is used to calculate our statistic \(\boldsymbol{s}\) and determine the experts. The prompt partition uses the full FF block while the generation partition only uses the selected experts. When comparing the original model with GRIFFIN, we only compute the perplexity of the outputs from the generation partition since the other outputs will be identical. Based on Figure 5, GRIFFIN gets closer to the full model outputs when the prompt length increases and generation length decreases, meaning the difficulty with long generation can be suppressed with longer prompts.
As further exploration into flocking, we investigate this phenomenon with random inputs. As input sequences, we use a sample from concatenated WikiText, a permuted version of that sample, and completely random sequence where tokens are uniformly sampled from the vocabulary. Seen in Figure 6, this structure exists in permuted and random inputs, perhaps even more consistently than in unperturbed sequences. This suggests something within language actually diversifies the activations, the cause of which would be of interest for future work.
In this work, we have shown a special form of sparsity in FF layers and a simple method to exploit it. Flocking is a curious phenomenon present in many LLMs where tokens within a sequence activate at similar intensities. This structure motivated the design of GRIFFIN, a learning-free MoE selection mechanism to remove FF neurons during inference at the sequence level which preserves the full model’s performance on a large collection of classification and generative tasks at 50% FF sparsity while achieving lower latency. Furthermore, its applicability extends beyond just ReLU-based LLMs, allowing MoE adaptation to be possible for many more models. With its straightforward algorithm and no-cost deployment, GRIFFIN expands the accessibility of numerous LLMs for generative inference.
We thank Zixin Wen for insightful discussions. The work of H. Dong is supported in part by the Wei Shen and Xuehong Zhang Presidential Fellowship at Carnegie Mellon University. The work of Y. Chi is supported in part by the grants NSF DMS-2134080 and ONR N00014-19-1-2404.
Here we present visualizations of our gating statistic \(\boldsymbol{s}\) from 1 . For a single sample, we find \(\boldsymbol{s}\) and sort the entries normalized between 0 and 1 in Figure 7. In both models, values in \(\boldsymbol{s}\) are heavily concentrated in a handful of features. Since \(\boldsymbol{s}\) aggregates the relative activation magnitudes across tokens, this implies \(\boldsymbol{s}\) can capture heavily and frequently activated neurons.
We provide more example of flocking across different layers of the LLM. Figure 8 and Figure 9 show flocking in Gemma 7B. Figure 10 and Figure 11 show flocking in Llama2 7B. Flocking in Gemma 7B is more visually distinct while activations in Llama2 7B are more distributed.
Department of Electrical and Computer Engineering, Carnegie Mellon University, USA; Emails: {harryd,beidic,yuejiec}@andrew.cmu.edu
.↩︎