A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs


Abstract

Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource-constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2\(\times\) and 2.87\(\times\) more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25\(\times\) compared to some state-of-the-art FPGA-based accelerators.

© 2024 ACM. Personal use of this material is permitted. Permission from ACM must be obtained for all other uses in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
This work has been submitted to ACM Transactions on Reconfigurable Technology and Systems journal.

<ccs2012> <concept> <concept_id>00000000.0000000.0000000</concept_id> <concept_desc>Embedded Systems</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>00000000.00000000.00000000</concept_id> <concept_desc>Machine Learning</concept_desc> <concept_significance>300</concept_significance> </concept> <concept> <concept_id>00000000.00000000.00000000</concept_id> <concept_desc>FPGA Architecture</concept_desc> <concept_significance>100</concept_significance> </concept> <concept> <concept_id>00000000.00000000.00000000</concept_id> <concept_desc>Hardware/software co-design</concept_desc> <concept_significance>100</concept_significance> </concept> </ccs2012>

1 Introduction↩︎

Transformer neural networks (TNN) have shown great performance in natural language processing (NLP) [1], machine translation [2], computer vision [3], and other fields in recent years. While recurrent neural network (RNN) [4] and long short-term memory (LSTM) [5] models run sequential computation tasks during both training and inference, transformer facilitates high levels of computation parallelism throughout both processes using an attention mechanism. Thus, TNN is becoming a potential alternative to CNN, RNN, and LSTM [6], [7]. There are many transformer models, such as full transformers containing both encoder and decoder [8], BERT [9], [10], ALBERT [11], structBERT [12], and others. These models contain different numbers of encoder and decoder stack [8] for different applications. A single encoder will often require a latency on the order of 100s of \(\mu S\) [13]. Around 38% to 64% of this time is spent in the multihead attention (MHA) mechanism depending on the number of tokens in the input sequence [14], [15], and the rest of the time is spent on feed forward network (FFN). Unfortunately, general-purpose platforms like GPUs and CPUs often suffer from low computational efficiency, underutilized memory bandwidth, and substantial compilation overheads for MHA layers [16]. MHA and FFN also occupy most of the on chip storage units [17][19] . Therefore, it is essential to prioritize efficient hardware deployment on resource-constrained devices. FPGAs have gained widespread use for accelerating DNNs due to their high level of parallelism, high energy efficiency, and low latency [20], [21]. Recently, some works have successfully built FPGA based custom hardware accelerators for transformers [13], [18], [22]. Application-specific integrated circuits (ASIC)-based accelerators also exist [23]. Most of these works compress the model by using different weight pruning strategies to accelerate computations and, as a result, must incorporate specialized hardware to support sparse tensor access. However, the benefit of supporting sparse tensors does not offset the added cost of this hardware . Lu et al. [22] accelerated the attention mechanism and feedforward network separately, but did not implement the full transformer encoder. Ye et al. focused on accelerating only the attention mechanism using a reconfigurable systolic array for the transformer. Similarly, Zhang et al. [24] concentrated on accelerating the attention layer through hardware-software co-design. Some other works accelerate the full transformer networks but their logic circuits go through the time-consuming synthesis steps for different models or they perform poorly on the same model with different configurations [25]. They lack generality to support other variants. As transformer variants continue to evolve with differing parameters, designing a generic and efficient accelerator that can be customized to the structural characteristics of these variants becomes increasingly valuable. Thus, a versatile accelerator is needed that can efficiently handle dense matrix computations across various TNN applications. Digital signal processing (DSP) resources are capable of high-speed computation at higher frequencies. Proper utilization of them depends on the implementation method. For example, most accelerators [13], [26][28] used high-level synthesis (HLS) tools, while some used hardware description language (HDL) [29][31] for design. While HLS requires less implementation time compared to HDL, writing efficient HLS code to use parallel DSPs for optimal performance is challenging [32]. Additional challenges include storing the vast number of TNN parameters in the on-chip memories of FPGAs, which typically have a size of 5MB for low-end devices such as the ZCU104 and 35MB for high-end devices such as the Alveo U200 [33].

A challenge is executing the extensive number of multiplication and accumulation (MAC) operations required by TNNs on the DSPs, with Ultrascale+ FPGAs offering approximately 9024 DSPs. Therefore, input matrices must be partitioned into tiles. However, developing an optimal partitioning scheme that aligns well with the architecture presents a significant challenge. The data access and computation patterns differ across various blocks within the transformer, which also prevents acceleration. Therefore, assigning a dedicated hardware module to each block allows easier design and optimization.

In this paper, we utilized HLS tool to design a programmable accelerator for transformers. Our HLS code is optimized to maximize the parallel usage of DSPs. The computations are handled by DSP48 slices, with the required data stored in BRAMs, LUTRAMs, or registers within the processing elements (PE). Our architecture features efficient tiling for both the attention mechanism and linear transformations, enabling enhanced parallel computations and communication to maximize the acceleration of transformers.

The paper contributes in the following ways to the development of high-performance FPGA-based accelerators for transformer neural networks (TNNs), with an emphasis on improving computational efficiency, resource utilization, and adaptability across different hardware configurations:

  • A novel accelerator architecture for a complete transformer that maximizes DSP and LUT utilization to enhance parallel processing and achieve low latency.

  • An efficient tiling strategy for weight matrices in both the multi-head attention layer and the feedforward neural network layer, enabling the deployment of the accelerator to any FPGA platform for most TNN models.

  • An analytical model to estimate resource utilization and latency.

  • A modular design approach to account for different computation and data access patterns in different components of the TNN.

  • A parameterized HLS code that enables design-time adjustments, making it easier to modify the design.

  • A runtime adaptive feature enabling dynamic adjustment of parameters from software, enabling execution of different models without any hardware re-synthesis.

  • The full source code 1 to reproduce the presented results or improve the design.

2 Background↩︎

2.1 Transformer Architecture↩︎

There are several building blocks in transformers as shown in Fig. 1 (a). An input sequence of tokens is converted into embeddings. The positional encoder enables the model to consider the order of tokens in a sequence by adding positional information to the embeddings. It generates vectors that give context according to the word’s position in a sentence. Then the vectors are linearly transformed into three tensors: Q (queries), K (keys), and V (values) by multiplying the embedding matrix with three weight matrices. The encoder block handles these tensors, transforming them into a higher-level representation that encapsulates crucial information. This process ensures the proper capture of features and contextual relationships within the input sequence. The encoder architecture comprises two main sub-layers: (1) the self-attention mechanism, and (2) the position-wise feed-forward network. The self-attention mechanism enables the model to assess different segments of an input sequence simultaneously. It captures long-range relationships by measuring attention scores and utilizing multi-head projections for various input representations. Thus, it can learn complex patterns, dependencies, and relationships effectively. The position-wise feed-forward network (FFN), which is equivalent to a multilayer perceptron (MLP), applies linear transformations to every position independently in the input sequence. In this network, two linear transformations are executed. They mainly contain matrix-vector multiplication. The first linear transformation has activation functions such as the Rectified Linear Unit (ReLU) or Gaussian Error Linear Unit (GeLU) but the second one does not have these. Furthermore, each sub-layer includes a residual connection combined with layer normalization (LN). This reduces the vanishing gradient problem during training. Residual addition and LN layers are inserted after each MHA and FFN. It mainly includes the addition of matrix elements and nonlinear functions. The decoder block illustrated in Fig. 1 (a) is responsible for generating the output sequence based on the encoded representations supplied by the encoder. Like the encoder, the decoder also consists of a stack of N identical layers. Each layer within the decoder contains three sub-layers. They are: (1) the Masked Attention Mechanism, resembling the encoder’s self-attention, and it includes a masking feature that restricts the output’s dependency on known preceding outputs; and (2) an attention layer that directs its focus to the encoder’s output, enabling the decoder to emphasize relevant sections of the input sequence for each output element. and (3) a position-wise feed-forward network.

a

Complete Architecture.

b

Multihead Attention Layer.

Figure 1: Transformer Neural Network.

The self-attention mechanism in transformers allows each position in the sequence to attend to all other positions, enabling the model to consider global context easily. Each attention head is composed of three linear layers and a scaled dot-product attention function. The parameter \(h\)–or number of heads–is equal to 8 in the Transformer base model or 16 in the Transformer big model. As illustrated in Fig. 1 (b), the scaled dot product attention in each head is a crucial part of the multihead attention layer. The attention weights are computed by performing the dot product of the query and key vectors and subsequently scaling it down by the square root of the dimension of the key vectors. This scaling is essential to prevent the dot products from becoming excessively large, which contributes to the stabilization of gradients during the training process. Subsequently, the scaled dot products undergo the softmax function, resulting in the computation of attention weights. These weights are then used to perform a weighted sum of the value vectors. The ultimate output is the projection of the concatenated sequences from all heads.

The output of MHA can be represented as Equation 1 & 2 . The input sequence X is linearly mapped into \(Q_i, K_i, V_i\) matrices using weights and biases. The parameter \(d_k = d_{model}/h\) is the dimension of \(Q_i\) and \(K_i\). \(d_{model}\) is a hyperparameter called embedding dimension, and \(h\) is the number of heads. The mask operation filters out all values of illegal connections before the softmax. The parameter \(d_k\) is set to 64 in both the transformer base model and the transformer large model.

2 \[\begin{align} \label{eq:attention}Attention (Q_i, K_i, V_i) = softmax \left( Mask \left(\frac{Q_iK^T_i}{\sqrt{d_k}} \right)\right)V_i\end{align}\tag{1}\] \[\begin{align} \label{q95k95v}Q_i = X \times W_q + B_q, \\K_i = X \times W_k + B_k, \\V_i = X \times W_v + B_v\end{align}\tag{2}\]

2 \[\begin{align} \label{ffn}{FFN}\_{ResBlock}(X) =& Layer\_Norm(X + \\ & ReLU(X \times W_1 + b_1) \times W_2 + b_2)\end{align}\tag{3}\] \[\begin{align} \label{ln95eq}Layer\_Norm(x) = \gamma \left( \frac{x-\mu}{\sqrt{{\sigma}^2 +\epsilon}} \right) + \beta\end{align}\tag{4}\]

The FFN ResBlock comprises a LN operation, residual addition, a ReLU activation, and two linear sublayers, as described in Equation 3 , where W1, W2 are weights and b1, b2 are biases. The operations for layer normalization, softmax, GELU and RELU activation functions are described in equations 4 , 5 , 6 , and 7 respectively, where x is the input vector (for a particular position in the sequence), \(\mu\) is the mean of x, \({\sigma}^2\) is the variance of x, \(\gamma\) and \(\beta\) are learnable parameters, and \(\epsilon\) is a small constant.

2 \[\begin{align} \tag{5}softmax(x_j) = \frac{e^{x_j}}{\sum\limits_{i=1} e^{x_i}} \\ \tag{6} \begin{align} GELU(x) = xP(X \le x) = x\times \frac{1}{2} [1 + erf(x/\sqrt(2))] \end{align} \end{align}\] \[\label{relu95eq} RELU(x) = \begin{cases} 0, & x < 0\\ x, & x \ge 0 \end{cases}\tag{7}\]

2.2 High Level Synthesis Design↩︎

High-Level Synthesis (HLS) allows designers to describe circuit functionality at a higher level of abstraction than that of hardware description language. HLS tools translate high-level code, typically written in languages like C, C++, or OpenCL, into Register-Transfer Level (RTL) code suitable for FPGA implementation. This approach offers several advantages, including faster development cycles and simplified design modifications, as designers can use familiar programming languages to describe the hardware. Moreover, HLS enables efficient design space exploration, allowing different architectures to be evaluated without extensive hardware design expertise, leading to the rapid creation of optimized accelerators optimized for power, performance, and area [34]. However, HLS does come with challenges, such as ensuring that the generated RTL meets the specified constraints. The success of the synthesized hardware is largely dependent on the robustness of the HLS tools and the expertise of the designer.

3 ADAPTOR’s Architecture↩︎

The core of the ADAPTOR is designed in C language on Vitis high-level synthesis (HLS) 2022.2.1 tool. C simulation confirms the algorithm’s correctness, while C/RTL co-simulation validates the functionality of the synthesized hardware. This section describes the HLS design technique that generates an optimized architecture utilizing most of the LUTs and DSPs in the processing modules, ensuring high parallelism of computation. The input vectors and weights are pre-stored in off-chip memory (DRAM/HBM) and transferred to the accelerator via AXI master interfaces. The accelerator receives control signals from the processor through an AXI-lite slave interface. There are loading units, computing modules and activation function units in the overall architecture, which are described below. Fig. 9 and 20 represent two main computing modules of ADAPTOR.

3.1 Load Inputs Unit↩︎

Inputs are loaded into an input BRAM from external memory. Input BRAM is a dual port BRAM implemented as a two-dimensional array of size (\(SL \times d_{model}\)) in HLS, where \(SL\) is sequence\(\_\)length. It is reused to store the outputs of each encoder/decoder layer so that it can send inputs to the subsequent layer. One Load_inputs unit loads data to intermediate input buffers of each attention head from external memory using Algorithm 2. These input buffers are declared as a two-dimensional array of size (\(SL \times TS_{MHA}\)). Therefore, tiling is applied along the column of the matrix, and they are replenished with new data (\(\frac{d_{model}}{TS_{MHA}}\)) times. Another Load_inputs unit loads data to the input buffers of the FFN1 module which are declared as a two-dimensional array of size (\(SL \times TS_{FFN}\)) using algorithm 4. The third Load_inputs unit loads data to the input buffers of the FFN2 module which are declared as a two-dimensional array of size (\(SL \times (4\times TS_{FFN}\))) using algorithm 5. Different units are used because of different number of computations, tile size and array shapes.

3.2 Load Weights Unit↩︎

There are three Load_weight units. One unit loads weights to the weight buffers of each attention head according to demand from external memories. Another unit loads weights to the weight buffers of the feedforward network. The third one loads weights for the layer normalization module. Weight buffers of the attention module are declared as two-dimensional arrays of size (\(\frac{d_{model}}{h} \times TS_{MHA}\)) in HLS. \(TS_{MHA}\) is tile_size for the attention module, which represents the dimension of the sub-matrices into which the larger weight matrices are divided. They are synthesized as dual-port BRAMs or LUTRAMs. Therefore, they are loaded with partial data from external memories for each tile at each iteration. Weight buffers of the feedforward network are declared as two-dimensional arrays of size (\(TS_{FFN}) \times (4\times TS_{FFN})\). TSFFN is tile_size for the FFN module which equals to \(\frac{Embedding\_Dimension}{No.\_of\_Tiles\_FFN}\). (4 \(\times\) TSFFN) equals to \(\frac{Hidden\_Dimension}{No.\_of\_Tiles\_FFN}\). Thus, they are tiled along two dimensions (row and column of the matrix) and loaded iteratively for each tile in each dimension. They are also synthesized as dual-port BRAMs or LUTRAMs. The weight buffers for the layer normalization unit are declared as one-dimensional arrays of size \(d_{model}\) in HLS. They are not tiled, and so, loaded at once, and they are converted into dual-port BRAMs or LUTRAMs after synthesis. Algorithm 3 describes the weight loading process.

3.3 Load Biases Unit↩︎

Biases are stored in registers because of their small size. There are three Load_bias units. One unit loads biases for attention head computations to the registers of each attention head from external memory described by Algorithm 6. The same algorithm 7 describes loading of biases for feedforward networks and layer normalization unit. Biases are declared as one-dimensional arrays in HLS and complete array partition pragma converts them into registers. Since they are not tiled, they are loaded once with all the data.

Figure 2: Load Inputs for MHA

Figure 3: Load Weights for MHA

Figure 4: Load Inputs for FFN1

Figure 5: Load Inputs for FFN2 & FFN3

Figure 6: Load Biases for MHA

Figure 7: Load Biases for FFN & Layer Norm.

Figure 8: Softmax

3.4 Activation Unit↩︎

The activation functions of the transformer are defined here. ReLU, GeLU and softmax activation functions are usually used in transformers. They are defined according to their equations and are implemented with LUTs after synthesis. The codes for ReLU and GeLU are simple, so, only the code for softmax is shown in Algorithm 8.

3.5 Layer Normalization Unit↩︎

The mean and variance of the outputs from the attention layer and feedforward network layer are calculated in Layer Normalization (LN) unit according to Equation 4 which is implemented by algorithm 10 in HLS. The values are normalized using the mean and variance, and the results are multiplied by weights and then added with biases element-wise. The outputs from FFN1 and FFN3 go through the LN unit before further processing.

3.6 Attention Module↩︎

The architecture to accelerate the attention module is shown in Fig. 9. There are three main processing modules (PM) in it. They are denoted as \({QKV}_{PM}\), \({QK}_{PM}\) and \({SV}_{PM}\) according to the output they produce. The number of instances for these modules depends on the number of attention heads (h). Each module contains an array of processing elements (PE). A PE is comprised of a DSP48 performing multiplication and accumulation (MAC) operations. The number of PEs (\(t\)) depends on the unrolling factor of the inner loop and the initiation interval of the pipelined outer loop. The PE array’s data access pattern and computational requirements differ across modules. Therefore, they are defined separately as functions in HLS to produce distinct sets of PE arrays. This approach enables optimization of each RTL module separately. Input data and weights are stored in multiple BRAMs/LUTRAMs to enable parallel access.

In our architecture, each PE is independent, with its own local memory, control, and computing unit. The weights (\(W_q\), \(W_k\), \(W_v\)) for generating query (Q), key (K) and value (V) matrix are declared as separate two-dimensional arrays of size (\(\frac{d_{model}}{h} \times TS_{MHA}\)) in HLS. \(TS_{MHA}\) is the tile size n the attention module. It represents the dimension of the sub-matrices into which the larger weight matrices are divided. The number of heads and tiles, and the array partition directive on HLS determine how the arrays will be partitioned to generate multiple two-port BRAMs. Due to the limited ports of BRAMs, array partitioning and data loading are efficiently managed to ensure that data required simultaneously by a DSP are stored in separate BRAMs. The \(Q\), \(K\), and \(V\) matrices of size (\(SL\times \frac{d_{model}}{h}\)) are stored in intermediate buffers where \(SL\) is the sequence length.

3.6.1 QKVPM module:↩︎

\({QKV}_{PM}\) module generates the query, key, and value matrices. This module contains the \(W_q\), \(W_k\), \(W_v\) buffers, and input (\(X_i\)) buffers from which data is accessed in parallel by parallel DSP units. The arrays used in this module are divided into subarrays using our tiling technique to fit into the BRAMs or LUTRAMs. The number of loop iterations of the \({QKV}_{PM}\) module depends on the tile size. There are a total of (\(\frac{d_{model}}{TS_{MHA}}\)) tiles or iterations. At each iteration, the \(W_q\), \(W_k\), \(W_v\), and \(X_i\) buffers are loaded with distinct data. Then the computations start in the PEs. Simultaneously, the biases for the \(Q\), \(K\), and \(V\) matrices are loaded to registers from off-chip memory while the \({QKV}_{PM}\) module performs computations. Then they are added with \(Q\), \(K\), \(V\) matrices. Algorithm 11 outlines the computations of this module where the 2nd loop (line 6) is pipelined causing the innermost loop (line 8) to be fully unrolled. This generates (\(\frac{d_{model}}{TS_{MHA}}\)) PEs.

3.6.2 QKPM module:↩︎

\({QK}_{PM}\) module performs the matrix-matrix multiplication operations between the \(Q\) and \(K\) matrices. Because these matrices are relatively small, they are not tiled. Algorithm 13 describes these operations. The innermost loop (line 6) is fully unrolled, generating (\(\frac{d_{model}}{h}\)) PEs for this module. This module contains the \(Q\) and \(K\) buffers from which data is retrieved by the DSP units. As the division operation described in Equation 1 is executed within this module using LUTs, the number of parallel operations is constrained to prevent excessive use of LUTs. A matrix (\(S\)) of attention weights is generated within this module, which is stored in BRAM or registers. Subsequently, these values are forwarded to the non-linear softmax activation function. This function, as described in HLS, performs its computations using LUTs and FFs.

3.6.3 SVPM module:↩︎

The output matrix (\(S\)) derived from the softmax operation is transmitted to the \({SV}_{PM}\) module (Algorithm 14), where it undergoes matrix-matrix multiplication operations with the value (V) matrix. Algorithm 14 fully unrolls the innermost loop (line 6) , resulting in \(SL\) PEs. The output from this module is referred to as the attention score.

3.7 Feedforward Network Module↩︎

There are three RTL modules for FFN to perform the operations of feedforward networks of different architectures. The definitions of the functions representing the modules have different dimensions of arrays for the inputs and outputs in HLS. These arrays are converted into BRAMs/LUTRAMs after synthesis. The number of computations inside each module is different, so each FFN has a separate function. They contain a different number of processing elements after synthesis because of different unrolling factors of the innermost loop. The weights are stored in a two-dimensional array (\(W_o\)) with dimensions (\(\frac{d_{model}}{TS_{FFN}} \times \frac{4 \times d_{model}}{TS_{FFN}}\)) in HLS, where \(TS_{FFN}\) represents the tile size in the FFN. Both \(FFN1_{PM}\) and \(FFN3_{PM}\) are followed by layer normalization (LN) modules.

Figure 9: Attention Module of ADAPTOR.

3.7.1 FFN1PM module:↩︎

\(FFN1_{PM}\) module performs first linear transformation on the attention scores. The arrays used by the PEs are tiled along both dimensions. Thus, this module is accessed \(TS_{FFN} \times TS_{FFN}\) times to complete the operation. Algorithm 15 outlines the computations of this module, where the 2nd loop is pipelined causing the innermost loop (line 7) to be fully unrolled. This generates \(TS_{FFN}\) PEs which equals to \(\frac{d_{model}}{Tile\;no.\;FFN}\).

3.7.2 FFN2PM module:↩︎

\(FFN2_{PM}\) module performs second linear transformation on the normalized outputs of \(FFN1_{PM}\) module. The arrays used by the PEs are tiled along both dimensions. Thus, this module is accessed \(4\times TS_{FFN} \times TS_{FFN}\) times to complete the operation. Algorithm 16 outlines the computations of this module, where the 2nd loop is pipelined causing the innermost loop (line 7) to be fully unrolled. This generates \(TS_{FFN}\) PEs that equals \(\frac{d_{model}}{Tile\;no.\;FFN}\).

3.7.3 FFN3PM module:↩︎

\(FFN3_{PM}\) module performs final linear transformation on the normalized outputs of \(FFN2_{PM}\) module. The arrays used by the PEs are tiled along both dimensions. Thus, this module is accessed \(4\times TS_{FFN} \times TS_{FFN}\) times to complete the operation. Algorithm 12 outlines the computations of this module, where the 2nd loop is pipelined causing the innermost loop (line 7) to be fully unrolled. This generates \(4\times TS_{FFN}\) PEs that equals \(\frac{4\times d_{model}}{Tile\;no.\;FFN}\).

Figure 10: Layer Normalization

3.8 Bias Add Unit↩︎

There are three Bias_add units for adding biases to the query (\(Q\)), key (\(K\)), and value (\(V\)) matrices and the outputs of the three feedforward networks. Three units are used because the HLS function definition corresponding to a unit takes arrays of different dimensions as inputs and generates arrays of different dimensions as outputs. One of the units used after feedforward networks contains the ReLU activation function. Algorithm 17, 18 & 19 are the operations of the three units.

Figure 11: Q, K, V Calculation

Figure 12: FFN3 Calculation

Figure 13: \(Q\times K^T\) Calculation

Figure 14: \(S\times V\) Calculation

Figure 15: FFN1 Calculation

Figure 16: FFN2 Calculation

Figure 17: Bias add unit 1

Figure 18: Bias add unit 2

Figure 19: Bias add unit 3

3.9 Tiling Technique↩︎

As transformer models tend to be large, tiling helps prevent excessive utilization of on-chip memory and computing units. It also ensures that HLS tool can effectively partition arrays, and pipeline or unroll the loops to minimize latency within a short compilation time. Fig. 21 (a) describes our unique tiling strategy on MHA. The weight matrices are partitioned into tiles, allowing the BRAMs to be loaded with partial data retrieved from off-chip memory. They are tiled along the second dimension (column of the matrix) only because the first dimension (row of the matrix) is already reduced by the number of heads. Thus, they are loaded (\(\frac{d_{model}}{{TS}_{MHA}}\)) times. The input buffers of each attention head are declared as a two-dimensional array of size (\(SL \times TS_{MHA}\)). Therefore, tiling is applied along the column of the matrix and loaded (\(\frac{d_{model}}{{TS}_{MHA}}\)) times. At each iteration, data for only one tile is loaded first. The PEs then perform computations on this data, storing the results in intermediate buffers. These results are also accumulated with those from previous iterations in the next cycle. Consequently, the final output is the cumulative sum of the outputs computed for all tiles.

Figure 20: Feedforward Network Module of ADAPTOR.

a

Tiling Technique in MHA.

b

Tiling Technique in FFN.

Figure 21: Tiling Technique..

The FFNs that follow the attention layer are the most time-consuming and resource-consuming layers. The weight matrices of the feedforward network are declared as two-dimensional arrays of size \((TS_{FFN})\times (4\times TS_{FFN})\). Thus, they are tiled on both dimensions (row and column of the matrix), and two loops are used to load them iteratively for each tile on each axis. The first FFN module is reused \((\frac{d_{model}}{TS_{FFN}})^2\) times because both loops iterate \(\frac{d_{model}}{TS_{FFN}}\) times. The second and third FFN modules are reused \((\frac{4\times (d_{model})^2}{(TS_{FFN})^2})\) times. Fig. 21 (b) describes our unique tiling strategy on FFN. Results are first accumulated along the columns, followed by accumulation along the rows for all tiles.

3.10 Tile Size Determination↩︎

In ADAPTOR, the programmable parameters can be adjusted at runtime, whereas the tile size must be set before synthesis, as it cannot be modified without re-synthesizing the entire hardware. The graph in Fig. 22 (a) & (b) illustrates how variations in \(TS_{MHA}\) and \(TS_{FFN}\) impact system frequency (MHz) and latency (normalized to the minimum value), respectively. The number of tiles in MHA (\(\frac{d_{model}}{TS_{MHA}}\)) was varied from 6 to 48, for each FFN tile count (\(\frac{d_{model}}{TS_{FFN}}\)) which ranged from 2 to 6. The plots indicate that the optimal configuration for achieving the highest frequency and the lowest latency was 12 tiles in MHA and 6 tiles in FFN. This combination would achieve a maximum frequency of 200 MHz. Moreover, experiments showed that \(TS_{MHA}\) of 64 and \(TS_{FFN}\) of 128 are optimal for HLS, allowing for efficient array partitioning within a reasonable compilation time (approximately 36 hours) for a state-of-the-art (SOTA) transformer.

Figure 22: Choosing the optimum tile size.

3.11 Software Control↩︎

The parameters such as attention heads, embedding dimension, hidden dimension, sequence length, and the number of encoders and decoders are programmable at runtime in our design. These parameters can be sent to the accelerator from the software using the steps shown in Fig. 23. TNN models are trained using the PyTorch framework, and the appropriate models should be saved as ‘.pth’ files. We used pre-trained models available on huggingface [35] on Tesla V100 GPU. These files will then be sent to a Python interpreter to extract the value of the parameters. These values will differ between applications, but our accelerator will not go through the synthesis steps for each one. Only some variables in the software code need to be assigned new values. The Xilinx SDK tool was used to write the software in C++, which runs on the processor. Algorithm 24 briefly describes the software part. Based on the extracted values from the interpreter, the processor generates instructions and control signals for the accelerator, allowing it to activate different parts of the hardware.

3.12 Configuration Registers↩︎

The ADAPTOR contains a set of registers accessed by Microblaze CPU using the AXI4-Lite interface. They are used to specify the topology of the TNN during runtime. The registers are described below, along with the corresponding parameters they store.

2

  • Sequence: sequence length of inputs.

  • Heads: number of attention heads.

  • Layers_enc: number of encoders.

  • Layers_dec: number of decoders.

 

  • Embeddings: dimension of the embedding layer.

  • Hidden: dimension of the intermediate layers.

  • Out: number of outputs.

4 Overall System↩︎

Fig. 25 shows the complete system design for running ADAPTOR on different FPGA platforms such as VC707 (Virtex-7 xc7vx485tffg1761-2), ZCU102 (Zynq UltraScale+xczu9eg-ffvb1156-2-e MPSoC) and U55C (UltraScale+xcu55c-fsvh2892-2L-e) for our experiments. Both VC707 and ZCU102 boards have onboard DDR3 DRAM memory, while the Alveo U55C contains high-bandwidth memory (HBM). Each design parameter can be programmed during runtime up to a maximum value by the Microblaze CPU. The overall system was designed on Vivado 2022.1.2 design suite. It contains a custom IP block for the TNN accelerator, which was exported from HLS. These Xilinx-AMD tools were run in Intel(R) Xeon(R) Gold 6130 CPU \(@\) 2.10GHz having 32 cores and 192 GB RAM (host PC). The inputs and weights are taken from the off-chip HBM or DRAM using AXI4 master interfaces [36] when the load instruction from the accelerator controller is received according to demand. The accelerator receives control signals from the processor through an AXI-lite slave interface. The CPU can access the HBMs/DRAMs that are connected to the accelerator. It is used to transfer data to the BRAMs from HBMs. It also sends control signals to the accelerator. The boards are connected to the host PC with USB-JTAG interface and PCIe 3.0\(\times\)​4 interface. This host can communicate with other IPs except the CPU using the DMA/Bridge Subsystem for PCI Express IP [37] in the system, but PCI communication was not needed in this work. The CPU uses AXI-TIMER[38] to measure the latency, which includes the time between the start and stop signal from the custom IP module. The host connected to JTAG cable[39] displays the results on the terminal using the UARTlite interface[40].

Figure 23: Programming procedures with software.

Figure 24: Software Program

Figure 25: Complete System Design

5 Theoretical Model↩︎

The parameters that affect resource utilization and performance of ADAPTOR are the tile size or number of tiles in the attention module, tile size or number of tiles in the feedforward network, number of attention heads, sequence length, embedding dimension, hidden dimension, and number of encoders and decoders when bit width is fixed. The number of DSPs depends on the number of parallel multiplications. They are mostly used in \({QKV}_{PM}\), \({QK}_{PM}\), \({SV}_{PM}\), and \(FFN\) modules. The number of BRAMs depends on the number of arrays to store data, the BRAM modes applied to them for synthesis, and how they are partitioned using HLS pragmas. An analytical model was developed to establish the relationship of latency and resource utilization with these parameters. This model helps to estimate latency, resource utilization, and the value of the parameters before synthesis.

5.1 Model for DSP utilization↩︎

Equation 8 gives an estimate for DSP consumption. It was derived from all the loops described in the functions that generate RTL modules for \({QKV}_{PM}\), \({QK}_{PM}\), \({SV}_{PM}\), and \(FFN\). \[\label{eq:dsp} \begin{align} No.\;of\;DSPs &= 3\times h \times \frac{d_{model}}{Tile\;no.\;MHA} + h\;\times \left(\frac{d_{model}}{h}+SL\right) + 6\times \frac{d_{model}}{Tile\;no.\;FFN}\;+ d_{model} \end{align}\tag{8}\]

The design is modular and each module is implemented as a function with loops. Thus, the latency of the modules depends on the loop iteration latency, which in turn depends on the loop pipeline and unrolling pragmas. For the nested loops in the modules, the second loop from the last is pipelined resulting in a complete unroll of the innermost loop. The outermost loop had no pragmas to avoid complicated pipeline depth and resource requirements. Pipelined loop latency (PLL) can be calculated by Eq. 9 . If it is enclosed by another loop, then the total latency (TL) will be given by Eq. 10 . Here, Trip_count (TC) is the number of iterations of a loop, and the initiation interval (II) is the latency between the initiation of two consecutive iterations. Pipeline_Depth is the latency to finish one iteration. It depends on the sequential and parallel operations executed in one iteration. Different modules of the accelerator can have different Pipeline_Depth (PD). The latency is measured in clock cycles (cc). \[\begin{align} \tag{9}Pipelined\_Loop\_Latency = Pipeline\_Depth + Initiation\_Interval\times(Trip\_Count - 1) \;\;\text{\cite{perform}} \\ \tag{10} \begin{align} Total\_Latency = Pipelined\_Loop\_Latency\times Outer\_Loop\_Trip\_Count \;\;\text{\cite{perform}} \end{align} \end{align}\]

Equation 9 & 10 are generalized equations for measuring latency, the variables of which differ for different modules of ADAPTOR as shown in the following equations.

5.2 Latency model for Attention Module↩︎

2 \[\begin{align} \tag{11}LI = [(d_{model}-1)\times 1 + PD\_L]\times SL \\ \tag{12} \begin{align} LBA = (\frac{d_{model}}{h}-1)\times 1 + PD\_L \end{align} \end{align}\] \[\begin{align} \tag{13}LWA = [(\frac{d_{model}}{h}-1) \times 1 + PD\_L] \times SL \\ \tag{14} \begin{align} LIA =& [(\frac{d\_{model}}{Tile\;no.\;MHA}-1) \times 1 + PD\_L] \times SL \end{align} \end{align}\]

where, PD_L is \(Pipeline\_Depth\_Load\) that includes the time required to establish communication with HBM using AXI master interface (7 cc), read address location (1 cc), load (1 cc), and store (1 cc) data from and to that address, and convert floating point data to fixed point (3 cc) for tasks such as loading all inputs (LI), as well as loading inputs (LIA), biases (LBA) and weights (LWA) for each attention head.

2 \[\begin{align} \tag{15}SA = [(\frac{d\_{model}}{h}-1)\times 1 + PD\_MHA]\times SL \\ \tag{16} \begin{aligned} BA =& [(\frac{d\_{model}}{h}-1)\times 1 + PD\_BA] \times SL \end{aligned}\\ \tag{17} \begin{align} Score (S) = [(SL-1)\times 1 + PD\_S]\times SL \end{align}\\ \tag{18} \begin{align} SV = [(\frac{d\_{model}}{h}-1)\times 1 + PD\_SV]\times SL \end{align} \end{align}\] \[\begin{align} \label{soft}\hfill\\ SM =& [(SL-1)\times 1 + Load + Store]\times SL + [(SL-\\& 1) \times 1 + Load + Store + add + \\& exponentiation]\times SL+[(SL-1)\times 2+\\& Load + Store + divide]\times SL\end{align}\tag{19}\]

\(Pipeline\_Depth\_MHA\) (PD_MHA) equals (\(\frac{d_{model}}{TS_{MHA}}\)) plus the time required to load, multiply (2 cc), add (1 cc), and store for computing self-attention (SA) in \(QKV_{PM}\) module. \(Pipeline\_Depth\_Bias\_Add\) (PD_BA) includes latency associated with loading, adding, and storing operations in bias addition (BA) tasks. \(Pipeline\_Depth\_Score\) (PD_S) equals (\(\frac{d_{model}}{h}\)), the time required to compute the score (S) in \(QK_{PM}\) module. \(Pipeline\_Depth\_SV\) (PD_SV) equals \(Sequence\_Length\) in the computation of SV within the \(SV_{PM}\) module. Equation 19 estimates time for softmax (SM) calculation, which includes exponentiation (4 cc) and division (14 cc). It starts after the \(QK_{PM}\) module is finished.

5.3 Latency model for FFN1 Module↩︎

\[\begin{align} \label{LIF}LIF1=[(\frac{d\_{model}}{Tile\;no.\;FFN}-1)\times 1 + PD\_LFFN1]\times SL \end{align}\tag{20}\]

2 \[\begin{align} \tag{21}LWF1=&[(\frac{d\_{model}}{Tile\;no.\;FFN}-1)\times 1 + PD\_L]\times \\& \frac{d\_{model}}{Tile\;no.\;FFN} \\ \tag{22} \begin{align} LBF1=(d\_{model}-1)\times 1 + PD\_L \end{align} \end{align}\] \[\begin{align} \tag{23}FFN1=&[(\frac{d\_{model}}{Tile\;no.\;FFN}-1)\times 1 + PD\_FFN1]\times \\& SL \\ \tag{24} \begin{align} BAF1=[(d\_{model}-1)\times 1 + PD\_BA]\times SL \end{align} \end{align}\]

where, \(Load\_Inputs\_FFN1\) (LIF1) unit loads tiled outputs from the attention module into the input buffer of the FFN1 module. \(Load\_Weights\_FFN1\) (LWF1) unit loads partial weights from off-chip memory to the weight buffer of the FFN1 module according to \(TS_{FFN}\). \(Pipeline\_Depth\_FFN1\) (PD_FFN1) equals (\(\frac{d_{model}}{Tile\;no.\;FFN}\)) plus the time required to perform load, add, and store operations in the FFN1 module. \(Pipeline\_Depth\_Load\_FFN1\) (PD_LFFN1) is the time required to load, add, and store in the loading units. FFN1 is the computation time of \(FFN1_{PM}\) module. \(Load\_Biases\_FFN\) (LBF1) loads biases to registers from off-chip memory while \(FFN1_{PM}\) operates. \(Bias\_Addition\_FFN1\) (BAF1) adds biases to the outputs of \(FFN1_{PM}\).

5.4 Model for BRAM utilization↩︎

Equation 25 gives an estimate for BRAM consumption. It was derived from all the arrays declared in HLS with true dual-port BRAM pragmas. \[\label{eq:bram} \begin{align} No.\;of\;BRAMs =& \frac{10\times SL\times d_{model} \times Bit\_w}{BRAM\_w\times BRAM\_d} + SL\;\times max\left(0.5, \frac{SL\times Bit\_w}{BRAM\_w\times BRAM\_d}\right) + max\left(0.5, \frac{SL\times d_{model}\times Bit\_w}{BRAM\_w\times BRAM\_d}\right) + \\ & \frac{h\times SL\times d_{model}\times Bit\_w}{BRAM\_w\times BRAM\_d} + max\left(0.5, \frac{d_{model}\times Bit\_w}{BRAM\_w\times BRAM\_d}\right) + \frac{SL\times Tile\;no.\;MHA\times Bit\_w}{BRAM\_w\times BRAM\_d} + \\ & Tile\;no.\;MHA \times h\times max\left(0.5, \frac{SL\times Bit\_w}{BRAM\_w\times BRAM\_d}\right) + \frac{8\times d_{model}^2 \times Bit\_w}{Tile\;no.\;FFN\times BRAM\_w\times BRAM\_d} + \\ & Tile\;no.\;MHA\times h \times max\left(0.5, \frac{d_{model}\times Bit\_w}{BRAM\_w\times BRAM\_d}\right) + \frac{d_{model}}{Tile\;no.\;FFN}\times max(0.5, \\ &\left.\frac{SL\times Bit\_w}{BRAM\_w\times BRAM\_d}\right) + 4\times d_{model}\times max\left(0.5, \frac{SL\times Bit\_w}{BRAM\_w\times BRAM\_d}\right) \end{align}\tag{25}\] Here, BRAM_d is the depth of BRAMs which indicates the number of storage locations (or entries) within a BRAM block. Each location holds a fixed number of bits, defined by the width of BRAM (BRAM_w), and both parameters can vary depending on the platform. Bit_w is the bit precision of the data being stored. BRAM_w = 36 and BRAM_d = 1024 for most FPGAs.

5.5 Latency model for LN Module↩︎

2 \[\begin{align} \tag{26}LWN = (d\_{model}-1)\times 1 + PD\_L \\ \tag{27} \begin{align} LBN = (d\_{model}-1)\times 1 + PD\_L \end{align} \end{align}\] \[\begin{align} \label{RC}RC = [(d\_{model}-1)\times 1 + PD\_BA]\times SL\end{align}\tag{28}\]

\[\label{LN} \begin{align} Layer\;Norm =& [(d\_{model}-1)\times 2 + Load + Add + Store]\times SL+[(d\_{model}-1)\times 2 + Load + multiply + add + \\ &store]\times SL+[(d\_{model}-1)\times 1+Load+Square+multiply+add+Store+divide+ \\ & float\_to\_fixed\_conversion]\times SL+[(d\_{model}-1)\times 1+Load+add+Store]\times SL \end{align}\tag{29}\]

where, \(Load\_Weights\_LN\) (LWN) unit loads weights from off-chip memory to the weight buffer of the LN module. \(Load\_Biases\_LN\) (LBN) loads biases to registers from off-chip memory. RC represents the operations of the residual connection in LN module. \(float\_to\_fixed\_conversion\) in the LN module takes 3 cc.

5.6 Latency model for FFN2 Module↩︎

2 \[\begin{align} \tag{30}LIF2=[(\frac{d_{model}}{Tile\;no.\;FFN}-1)\times 1 + PD\_LFFN2]\times SL \\ \tag{31} \begin{align} LWF2=&[(\frac{d_{model}}{Tile\;no.\;FFN}-1)\times 1 + PD\_L]\times\\& \frac{d_{model}}{Tile\;no.\;FFN} \end{align} \end{align}\] \[\begin{align} \tag{32}LBF2=(d_{model}-1)\times 1+PD\_L \\ \tag{33} \begin{aligned} FFN2=[(\frac{4\times d_{model}}{Tile\;no.\;FFN}-1)\times 1+PD\_FFN2]\times SL \end{aligned}\\ \tag{34} \begin{align} BAF2=[(4\times d_{model}-1)\times 1+PD\_BA]\times SL \end{align} \end{align}\]

where, \(Load\_Inputs\_FFN2\) (LIF2) unit loads tiled outputs from the FFN1 module into the input buffer of the FFN2 module. \(Load\_Weights\_FFN2\) (LWF2) unit loads partial weights from off-chip memory to the weight buffer of the FFN2 module according to \(TS_{FFN}\). \(Pipeline\_Depth\_FFN2\) equals (\(\frac{d_{model}}{Tile\;no.\;FFN}\)) plus the time required to perform load, add, and store operations in the FFN2 module. \(Pipeline\_Depth\_Load\_FFN2\) (PD_LFFN2) is the time required to load, add, and store in the loading units. FFN2 is the computation time of \(FFN2_{PM}\) module. \(Load\_Biases\_FFN2\) (LBF2) loads biases to registers from off-chip memory while \(FFN2_{PM}\) operates. \(Bias\_Addition\_FFN2\) (BAF2) adds biases to the outputs of \(FFN2_{PM}\).

5.7 Latency model for FFN3 Module↩︎

2 \[\begin{align} \tag{35}LIF3=[(\frac{4\times d_{model}}{Tile\;no.\;FFN}-1)\times 1 + PD\_LFFN3]\times SL \\ \tag{36} \begin{align} LWF3=&[(\frac{4\times d_{model}}{Tile\;no.\;FFN}-1)\times 1 + PD\_L]\times\\& \frac{d_{model}}{Tile\;no.\;FFN} \end{align} \end{align}\] \[\begin{align} \tag{37}LBF3=(d_{model}-1)\times 1 + PD\_L \\ \tag{38} \begin{aligned} FFN3=[(\frac{d_{model}}{Tile\;no.\;FFN}-1)\times 1 + PD\_FFN3]\times SL \end{aligned}\\ \tag{39} \begin{align} BAF3=[(d_{model}-1)\times 1 + PD\_BA]\times SL \end{align} \end{align}\]

where, \(Load\_Inputs\_FFN3\) (LIF3) unit loads tiled outputs from the FFN2 module into the input buffer of the FFN3 module. \(Load\_Weights\_FFN3\) (LWF3) unit loads partial weights from off-chip memory to the weight buffer of the FFN3 module according to \(TS_{FFN}\). \(Pipeline\_Depth\_FFN3\) equals (\(\frac{4\times d_{model}}{Tile\;no.\;FFN}\)) plus the time required to perform load, add, and store operations in the FFN3 module. \(Pipeline\_Depth\_Load\_FFN3\) is the time required to load, add, and store in the loading units. FFN3 is the computation time of \(FFN3_{PM}\) module. \(Load\_Biases\_FFN3\) (LBF3) loads biases to registers from off-chip memory while \(FFN3_{PM}\) operates. \(Bias\_Addition\_FFN3\) (BAF3) adds biases to the outputs of \(FFN3_{PM}\).

6 Evaluation and Results↩︎

ADAPTOR has software programmability to modify the design parameters, including the embedding dimension (\(d_{model}\)), number of heads (h), number of encoder layers (N), sequence length (SL) etc. in runtime from software. They were initially set with fixed values—768, 12, 12, and 64 respectively—based on a variant of BERT[10], which is a transformer model for NLP and the available FPGA resources. The tile sizes cannot be altered after synthesis. Thus, the synthesis was performed with fixed tile sizes of \(TS_{MHA}\) = 64 and \(TS_{FFN}\) = 128. This approach allows ADAPTOR to be synthesized once for a fixed set of resources while retaining the flexibility to support various TNN models as needed.

a

Variation of Performance with number of attention heads.

b

Variation of Resources with number of attention heads.

Figure 26: Performance and resource utilization VS attention heads..

Fig. 26 (a) illustrates the impact of the number of attention heads on system frequency and normalized latency. The latency includes computation time considering that the data loading time is overlapped with computation time. While increasing the number of attention heads is expected to enhance parallelism and reduce latency, the system frequency decreases beyond a certain point, leading to an increase in latency. Optimal performance is observed with 6 to 10 attention heads. Fig. 26 (b) shows the increase in resource usage for both DSPs and LUTs as the number of attention heads increases. The high resource utilization is responsible for the reduced frequency.

Figure 27: Utilization vs tile size.

Fig. 27 illustrates how DSP, LUT, and BRAM utilization is influenced by different combinations of tile sizes (\(TS_{MHA}\), \(TS_{FFN}\)). Since the processing modules rely on DSPs for MAC operations, DSPs are the most heavily used and can reach full utilization before BRAMs, making the accelerator computation-bound. As the tile sizes for both the attention and feedforward modules increase, more DSPs are utilized for parallel operations, reducing latency until a drop in frequency occurs.

Fig. 28 compares the power consumption (in watts) and power efficiency (throughput per watt, GOPS/W) for various models across different CPUs, GPUs, and our FPGA accelerator. Data for different models and platforms were obtained from cited literature, and we used them to compare the performance of ADAPTOR’s performance on the U55C platform for the same models. Since ADAPTOR is synthesized only once, and power is measured using Vivado’s power estimation tool post-synthesis, the total dynamic power consumption remains constant for all models. The JETSON TX2 GPU [18] achieves the highest power efficiency for the BERT model, mainly due to the sparse architecture of the algorithm, and also has the lowest overall power consumption. The RTX K5000 GPU [41] is 1.5\(\times\) more power efficient than ADAPTOR for the BERT model, due to compression techniques, but consumes 10\(\times\) more power. The i7-8700K CPU is the least power-efficient for BERT [41]. ADAPTOR is 1.2\(\times\) and 2.87\(\times\) more power efficient than the NVIDIA K80 GPU and i7-8700K CPU, respectively, when running BERT, according to FQ-BERT [42]. A custom encoder with four encoding layers was run on an i5-4460 CPU and an RTX 3060 GPU [30], both of which were 5.1\(\times\) and 1.63\(\times\) less power efficient than ADAPTOR while also being more power-hungry. Fang et al.[43] executed a shallow transformer on an i9-9900X CPU, JETSON NANO GPU, RTX 2080, and RTX 3090 GPUs. Although the JETSON NANO GPU consumed 1.56\(\times\) less power than ADAPTOR, the other devices used 14–30\(\times\) more power. However, ADAPTOR is 3.7\(\times\), 1.28\(\times\), 4.4\(\times\), and 1.67\(\times\) more power efficient than all of them.

Figure 28: Cross Platform Comparison of Power Consumption.

Fig. 29 illustrates that ADAPTOR can be deployed on any platform, regardless of the size of the TNN model or available resources, by adjusting the \(TS_{MHA}\) and \(TS_{FFN}\) parameters in HLS during design time. The figure presents results for a custom TNN encoder with an embedding dimension of 200, 3 attention heads, 2 encoder layers, and a sequence length of 64. On the Alveo U55C, the tile sizes can be maximized (\(TS_{MHA}\) = 200, \(TS_{FFN}\) = 200) due to the abundance of resources, resulting in lower latency. For the ZCU102 board, the tile sizes were reduced to 25 and 50 respectively, to fit the model within its resource constraints, nearly consuming 100% of the DSPs and LUTs and increasing the latency. On the VC707 board, \(TS_{MHA}\) and \(TS_{FFN}\) were set to 50 each, as it has slightly more resources than the ZCU102. However, latency increased as fewer DSPs were utilized, and LUT consumption almost reached its limit.

Figure 29: Testing Portability Feature.

Fig. [roof] illustrates the roofline model of ADAPTOR showing the peak performance and peak memory bandwidth any application can achieve with it. The Memory Bound (blue dashed line) indicates the highest performance attainable based on the available memory bandwidth, which is 200 kB/s. Any data point positioned to the left of this line signifies that performance is constrained by memory bandwidth. The Compute Bound (red line) represents the maximum performance achievable by the FPGA’s computational resources, limited to 0.053 TOPS. Points below the compute bound line indicate that the computational resources are underutilized. All data points (green, yellow, and purple) fall within the compute and memory bound regions, meaning none of them fully utilize the available resources of ADAPTOR. The yellow square, representing the BERT model with \(TS_{MHA} = 64\) and \(TS_{FFN} = 192\), achieves the highest performance, being closest to the compute bound. In contrast, the purple star, representing the shallow transformer model with \(TS_{MHA} = 64\) and \(TS_{FFN} = 128\), has the highest operational intensity but the lowest performance.

Figure 30: Peak Performance and Peak Memory Bandwidth.

0.05pt

Table 1: Comparison with FPGA Accelerators.
Accelerator DSP LUT GOPS Power (GOPS/DSP) (GOPS/LUT) GOPS/ Method Sparsity
(W) \(\times\)​1000 \(\times\)​1000 Power
Network #1 Shallow Transformer
Fang et al. [43] 4160 (34%) 464 k (27%) 1467 27 353 3.16 13 HDL 75%
Qi et al. [19] 3572 (52%) 485 k (41%) 14 3.92 0.03 HLS 80%
1-8 Qi et al. [33] 5040 (74%) 908 k (76%) 12 2.38 0.013 86%
1-8 ADAPTOR 3612 (40%) 391 k (30%) 27 11.8 7.47 0.069 2.28 0%
Network #2 Custom Transformer Encoder
Qi et al. [33] 4145 (60%) 937 k (79%) 75.94 18 0.08 HLS 0%
1-8 ADAPTOR 3612 (40%) 391 k (30%) 132 11.8 37 0.34 11
Network #3 BERT
Ftrans [18] 6531 (95%) 451 k (38%) 1053 25.06 161 2.33 42 HLS 93%
1-8 FQ-BERT [42] 1751 (69%) 123 k (45%) 254 9.8 145 2.06 26 87%
1-8 Tzanos et al. [44] 5861 (85%) 910 k (77%) 65.7 11.2 0.07 0%
1-8 TRAC [45] 1379 (80%) 126 k (55%) 128 93 1.01
1-8 ADAPTOR 3612 (40%) 391 k (30%) 40 11.8 11 0.10 3.39 0%

Table 1 compares the performance of our accelerator, ADAPTOR, with other FPGA-based accelerators. Each of these accelerators is optimized for specific TNN models, with some designed specifically for sparse computations. TRAC [45] is the only one that automatically generates accelerator code based on the target FPGA and TNN architecture. Since ADAPTOR was synthesized once with fixed hardware resources and bit width, and implemented on a different platform without sparsity, we evaluated metrics such as throughput (GOPS), power consumption, normalized throughput (GOPS per DSP or GOPS per LUT), and power efficiency (GOPS per watt) for a fair comparison. ADAPTOR achieved 1.9\(\times\) and 2.25\(\times\) higher GOPS compared to the accelerators by Qi et al. in [19] and [33], respectively, for a shallow transformer. Its normalized throughput was also higher, indicating more efficient DSP and LUT usage without relying on sparsity, whereas Qi et al. applied block balanced pruning and complex block row storage format. Fang et al. [43] reported higher throughput, normalized throughput, and power efficiency than ADAPTOR, using HDL design techniques and 75% sparse transformer. Qi et al.’s four-layer transformer encoder [33] was 1.7\(\times\) slower and 2\(\times\) less resource-efficient than ADAPTOR although they applied hierarchical pruning. FTRANS [18], FQ-BERT [42], and TRAC [45] used various weight compression techniques to achieve very high GOPS, throughput, and power efficiency for the BERT model. FTRANS used a block-circulant matrix-based weight representation. FQ-BERT compressed BERT by integer and fixed point quantization. However, since their implementations are not fully open source and difficult to interpret, a direct comparison with ADAPTOR is challenging. It is unclear how FQ-BERT achieved such high throughput with low DSP and LUT utilization. FTRANS used more DSPs and LUTs than our design, but how they managed this in HLS without long compilation time and without tiling remains unclear. TRAC [45] also used fewer DSPs and LUTs but reported 3.2\(\times\) higher GOPS and 8.4\(\times\) higher GOPS/DSP, which is puzzling. None of these accelerators applied tiling or described any partitioning scheme to accelerate large models such as BERT, which seems impractical. Tzanos et al.’s work [44] is more comparable to ADAPTOR, as they applied tiling and utilized more resources to achieve speeds 1.6 times faster with GOPS / DSP similar to ADAPTOR.

Fig.31 illustrates the variations in DSP consumption as the tile size of the MHA and FFN layers changes. As DSP utilization increases, GOPS rises up to a certain point. However, as shown in Fig.22, the frequency decreases beyond certain tile sizes. This results in a reduction of GOPS to 30 and 32 for 65% and 70% DSP utilization, respectively.

Figure 31: Effect of DSP on GOP/s for Various Tile Size Combinations.

0.05pt

Table 2: Validation of Experimental and Analytical Results
Method Sequence Embedding Number Tile Tile DSPs BRAMs Frequency Latency (ms)
10-12 Length Dimension of Size Size 18k (MHz) Attention Load FFN
Heads MHA FFN Module Weights Unit Module
(SA) (LWA) (FFN1)
Analytical 64 768 8 64 128 3784 2375 200 0.052 0.037 0.082
1-1 Experimental 3612 2246 0.053 0.038 0.084
1-8
1-8 Analytical 128 768 8 64 128 3784 2375 0.103 0.037 0.165
1-1 Experimental 3612 2246 0.106 0.038 0.168
1-8
1-8 Analytical 64 512 8 64 128 3784 2375 0.042 0.025 0.055
1-1 Experimental 3612 2246 0.043 0.026 0.056
Analytical 64 768 8 128 192 6272 2955 135 0.11 0.1 0.18
1-1 Experimental 6317 1693 0.11 0.1 0.23

Table 2 compares the experimental data of ADAPTOR with the theoretical results from Sec. 5. Only a few design configurations are presented, with data focused on the computation time for the attention and feedforward modules, as well as the loading time of the attention module for simplicity. Key parameters like sequence length, embedding dimension, and number of heads influence latency. The experimental latency closely matched the theoretical values, with only a 1.8% margin of error. Resource utilization remained consistent across configurations when tile size was fixed. Both analytical and experimental data varied when tile size changed, with deviations of 0.71–4.7% for DSPs and 5.7–74% for BRAMs. The higher deviation from the estimated BRAMs for larger tile sizes occurred because LUTRAMs were used more than BRAMs to maintain high frequency.

[30] [41]

7 Conclusion↩︎

In this article, we present a runtime-adaptive FPGA-based accelerator for the encoder and decoder layers of transformer neural networks (TNN), designed using a high-level synthesis (HLS) tool. The architecture leverages FPGA parallelism as well as the inherent parallel nature of TNNs. We demonstrated its deployment on various FPGA platforms, including Alveo U55C, VC707, and ZCU102, highlighting how resources like DSPs and LUTs can be effectively utilized to maximize parallelism and minimize latency in HLS designs. The accelerator is software-programmable, enabling adaptability to different topologies without requiring new code generation or re-synthesis. We implemented an efficient tiling technique and data-loading method for weight matrices, ensuring portability and resource-efficient execution across different TNN models. Experimental results indicate that our design outperforms certain CPUs and GPUs in terms of dynamic power consumption and power efficiency, despite no algorithmic optimizations. Moreover, it achieved a 1.7 to 2.25× speedup over leading FPGA-based accelerators. An analytical model was also developed to validate the experimental findings.

8 Acknowledgments↩︎

This material is based upon work supported by the National Science Foundation under Grant No. 1956071.

References↩︎

[1]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
[2]
K. Song, K. Wang, H. Yu, Y. Zhang, Z. Huang, W. Luo, X. Duan, and M. Zhang, “Alignment-enhanced transformer for constraining nmt with pre-specified translations,” in AAAI Conference on Artificial Intelligence, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:213842037.
[3]
T. Wang, L. Gong, C. Wang, Y. Yang, Y. Gao, X. Zhou, and H. Chen, “,” , vol. 41, no. 11, pp. 4088–4099, Nov. 2022. [Online]. Available: https://ieeexplore.ieee.org/document/9925700/.
[4]
K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoderdecoder approaches,” in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, D. Wu, M. Carpuat, X. Carreras, and E. M. Vecchi, Eds.1em plus 0.5em minus 0.4emDoha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 103–111. [Online]. Available: https://aclanthology.org/W14-4012.
[5]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, p. 1735–1780, nov 1997. [Online]. Available: https://doi.org/10.1162/neco.1997.9.8.1735.
[6]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” ArXiv, vol. abs/2010.11929, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:225039882.
[7]
J.-B. Cordonnier, A. Loukas, and M. Jaggi, “On the relationship between self-attention and convolutional layers,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=HJlnC1rKPB.
[8]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[9]
J. J. Lin, R. Nogueira, and A. Yates, “Pretrained transformers for text ranking: Bert and beyond,” Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:222310837.
[10]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[11]
Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
[12]
W. Wang, B. Bi, M. Yan, C. Wu, Z. Bao, J. Xia, L. Peng, and L. Si, “Structbert: Incorporating language structures into pre-training for deep language understanding,” arXiv preprint arXiv:1908.04577, 2019.
[13]
H. Peng, S. Huang, S. Chen, B. Li, T. Geng, A. Li, W. Jiang, W. Wen, J. Bi, H. Liu, and C. Ding, “,” in .1em plus 0.5em minus 0.4emSan Francisco California: ACM, Jul. 2022, pp. 1135–1140. [Online]. Available: https://dl.acm.org/doi/10.1145/3489517.3530585.
[14]
T. J. Ham, Y. Lee, S. H. Seo, S. Kim, H. Choi, S. J. Jung, and J. W. Lee, ELSA: Hardware-SoftwareCo-design for Efficient, LightweightSelf-AttentionMechanism in NeuralNetworks,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Jun. 2021, pp. 692–705, iSSN: 2575-713X.
[15]
P. Rajpurkar, R. Jia, and P. Liang, “Know what you dont know: Unanswerable questions for SQuAD,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), I. Gurevych and Y. Miyao, Eds.1em plus 0.5em minus 0.4emMelbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 784–789. [Online]. Available: https://aclanthology.org/P18-2124.
[16]
S. Zeng, J. Liu, G. Dai, X. Yang, T. Fu, H. Wang, W. Ma, H. Sun, S. Li, Z. Huang, Y. Dai, J. Li, Z. Wang, R. Zhang, K. Wen, X. Ning, and Y. Wang, “Flightllm: Efficient large language model inference with a complete mapping flow on fpgas,” in Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, New York, USA, 2024. [Online]. Available: https://doi.org/10.1145/3626202.3637562.
[17]
P. Ganesh, Y. Chen, X. Lou, M. A. Khan, Y. Yang, H. Sajjad, P. Nakov, D. Chen, and M. Winslett, “Compressing large-scale transformer-based models: A case study on bert,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1061–1080, 2021.
[18]
B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, M. Xie, L. Wan, H. Liu, and C. Ding, “,” in .1em plus 0.5em minus 0.4emBoston Massachusetts: ACM, Aug. 2020, pp. 175–180. [Online]. Available: https://dl.acm.org/doi/10.1145/3370748.3406567.
[19]
P. Qi, Y. Song, H. Peng, S. Huang, Q. Zhuge, and E. H.-M. Sha, “,” in .1em plus 0.5em minus 0.4emVirtual Event USA: ACM, Jun. 2021, pp. 163–168. [Online]. Available: https://dl.acm.org/doi/10.1145/3453688.3461739.
[20]
K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “[dl] a survey of fpga-based neural network inference accelerators,” ACM Trans. Reconfigurable Technol. Syst., vol. 12, no. 1, mar 2019. [Online]. Available: https://doi.org/10.1145/3289185.
[21]
M. Rognlien, Z. Que, J. G. F. Coutinho, and W. Luk, “,” in , L. Gan, Y. Wang, W. Xue, and T. Chau, Eds.1em plus 0.5em minus 0.4emCham: Springer Nature Switzerland, 2022, vol. 13569, pp. 118–133, series Title: Lecture Notes in Computer Science. [Online]. Available: https://link.springer.com/10.1007/978-3-031-19983-7_9.
[22]
S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, “,” in .1em plus 0.5em minus 0.4emLas Vegas, NV, USA: IEEE, Sep. 2020, pp. 84–89. [Online]. Available: https://ieeexplore.ieee.org/document/9524802/.
[23]
T. J. Ham, S. Jung, S. Kim, Y. H. Oh, Y. Park, Y. Song, J.-H. Park, S. Lee, K. Park, J. W. Lee, and D.-K. Jeong, “A3: Accelerating attention mechanisms in neural networks with approximation,” 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 328–341, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:211296403.
[24]
X. Zhang, Y. Wu, P. Zhou, X. Tang, and J. Hu, “,” , vol. 20, no. 5s, pp. 1–24, Oct. 2021. [Online]. Available: https://dl.acm.org/doi/10.1145/3477002.
[25]
S. Hur, S. Na, D. Kwon, J. Kim, A. Boutros, E. Nurvitadhi, and J. Kim, “A fast and flexible fpga-based accelerator for natural language processing neural networks,” ACM Trans. Archit. Code Optim., vol. 20, no. 1, feb 2023. [Online]. Available: https://doi.org/10.1145/3564606.
[26]
H. Peng, S. Huang, T. Geng, A. Li, W. Jiang, H. Liu, S. Wang, and C. Ding, “,” in .1em plus 0.5em minus 0.4emSanta Clara, CA, USA: IEEE, Apr. 2021, pp. 142–148. [Online]. Available: https://ieeexplore.ieee.org/document/9424344/.
[27]
Z. Jiang, D. Yin, E. E. Khoda, V. Loncar, E. Govorkova, E. Moreno, P. Harris, S. Hauck, and S.-C. Hsu, “.”
[28]
F. Wojcicki, Z. Que, A. D. Tapper, and W. Luk, “,” in .1em plus 0.5em minus 0.4emHong Kong: IEEE, Dec. 2022, pp. 1–8. [Online]. Available: https://ieeexplore.ieee.org/document/9974463/.
[29]
Y. Chen, T. Li, X. Chen, Z. Cai, and T. Su, “,” , vol. 12, no. 4, p. 822, Jan. 2023, number: 4 Publisher: Multidisciplinary Digital Publishing Institute. [Online]. Available: https://www.mdpi.com/2079-9292/12/4/822.
[30]
X. Yang and T. Su, “,” , vol. 11, no. 21, p. 3550, Oct. 2022. [Online]. Available: https://www.mdpi.com/2079-9292/11/21/3550.
[31]
Y. Bai and F. University, “.”
[32]
E. Kabir, D. Coble, J. N. Satme, A. R. Downey, J. D. Bakos, D. Andrews, and M. Huang, “Accelerating lstm-based high-rate dynamic system models,” in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), 2023, pp. 327–332.
[33]
P. Qi, E. H.-M. Sha, Q. Zhuge, H. Peng, S. Huang, Z. Kong, Y. Song, and B. Li, “,” in .1em plus 0.5em minus 0.4emMunich, Germany: IEEE, Nov. 2021, pp. 1–9. [Online]. Available: https://ieeexplore.ieee.org/document/9643586/.
[34]
H. Ye, C. Hao, J. Cheng, H. Jeong, J. Huang, S. Neuendorffer, and D. Chen, “Scalehls: A new scalable high-level synthesis framework on multi-level intermediate representation,” 2021. [Online]. Available: https://arxiv.org/abs/2107.11673.
[35]
BERT — huggingface.co,” https://huggingface.co/docs/transformers/en/model_doc/bert, [Accessed 15-09-2024].
[36]
AMDTechnical Information Portal — docs.amd.com,” https://docs.amd.com/r/en-US/ug1399-vitis-hls/AXI4-Master-Interface, [Accessed 21-07-2024].
[37]
“Introduction • DMA/BridgeSubsystem for PCIExpressProductGuide(PG195) • ReaderDocumentationPortal.” [Online]. Available: https://docs.xilinx.com/r/en-US/pg195-pcie-dma.
[38]
AMDTechnical Information Portal — docs.amd.com,” https://docs.amd.com/v/u/en-US/axi_timer_ds764, [Accessed 15-07-2024].
[39]
Programmers — digilent.com,” https://digilent.com/shop/fpga-boards/programmers/, [Accessed 15-07-2024].
[40]
AMDTechnical Information Portal — docs.amd.com,” https://docs.amd.com/v/u/en-US/axi_uartlite_ds741, [Accessed 15-07-2024].
[41]
Y. Han and T. University, “.”
[42]
Z. Liu, G. Li, and J. Cheng, “,” in .1em plus 0.5em minus 0.4emGrenoble, France: IEEE, Feb. 2021, pp. 513–516. [Online]. Available: https://ieeexplore.ieee.org/document/9474043/.
[43]
C. Fang, A. Zhou, and Z. Wang, “,” , vol. 30, no. 11, pp. 1573–1586, Nov. 2022, arXiv:2208.06118 [cs]. [Online]. Available: http://arxiv.org/abs/2208.06118.
[44]
G. Tzanos, C. Kachris, and D. Soudris, “,” in .1em plus 0.5em minus 0.4emTripolis, Greece: IEEE, Dec. 2022, pp. 1–5. [Online]. Available: https://ieeexplore.ieee.org/document/9976354/.
[45]
P. Plagwitz, F. Hannig, and J. Teich, “,” in .1em plus 0.5em minus 0.4emBelfast, United Kingdom: IEEE, Aug. 2022, pp. 17–23. [Online]. Available: https://ieeexplore.ieee.org/document/10035242/.

  1. https://github.com/Kabir-Ehsan/Transformer_on_FPGA↩︎