Addressing Architectural Obstacles for Overlay with Stream Network Abstraction


Abstract

Overlay is an effective approach for creating FPGA-based AI accelerators, enabling software-programmable specialized hardware datapaths to flexibly support various DNN operations. Traditional DNN overlays typically base their instruction set design on the von Neumann model but adapt them to be more coarse-grained. These instruction sets control execution at the layer granularity and impose restricted patterns for mapping computation and bandwidth resources. Such constraints cause inefficiencies from the imperfect match between supported execution patterns and diverse DNN layer shapes and types.

This work proposes a Reconfigurable Stream Network architecture, a unique ISA abstraction tailored for flexible FPGA overlay execution at low cost, marking it as the first known FPGA design to support dynamic sequential linear layer pipelining. This novel architecture presents a datapath abstraction modeled after a specialized circuit-switched network with stateful functional units (FUs) as nodes and data streaming on edges. Programming a computation corresponds to triggering a network path in this stream-connected datapath. The program can individually control FUs to form paths that exploit both spatial and pipeline parallelism between independent and dependent concurrent computations. We present a proof-of-concept design RSN-XNN on the Versal VCK190. Evaluations show a 22x latency reduction for BERT compared to the state of the art, along with throughput improvements of 3.2x, 2.4x, 2.5x, and 2.8x for BERT, VIT, NCF, and MLP, respectively. RSN-XNN matches the latency of the T4 GPU with the same FP32 performance but only 18% of the memory bandwidth. Compared to the A100 GPU under the same 7nm process node, it achieves 2.1x/4.5x better operating/dynamic energy efficiency in FP32.

Architecture, Dataflow, Overlay, Parallelism, Streaming, FPGA, Versal ACAP, HLS,Transformer

1 Introduction↩︎

Artificial intelligence is revolutionizing computing in areas such as language processing and computer vision, which drives dramatic needs for high performance and energy efficiency. While GPUs have seen great success in AI, some critics argue that their general-purpose hardware support introduces unnecessary overheads, motivating the emergence of ASIC-based AI accelerators such as TPU [1] and Groq [2]. Although ASICs offer good efficiency, they require lengthy chip development cycles; therefore, FPGA-based accelerators are increasingly being studied because they allow hardware datapaths to be reconfigured without the need to tape out a new chip.

There are two popular styles for building FPGA-based AI accelerators: fixed function [3][11] and overlay [12][24]. Fixed-function designs are efficient when flexible datapath reuse is unnecessary, but are inefficient when temporal reuse of the datapath is complex and not completely clear at the outset, as they either must integrate reuse logic directly into the datapaths or suffer from sub-second bitstream reconfiguration latency. DNN overlays reintroduce the “stored program” concept into FPGAs’ native dataflow execution by using instructions to control temporal datapath reuse. Many overlays envision leveraging the efficiency of datapath customization and CPU-like software programmability at the same time [12], [13], [19]. However, despite this ambition, the reality is that current DNN overlays often have more restricted inter-layer execution flexibility than fixed-function designs, as discussed in Section 2.2. According to our diagnosis, traditional overlays typically base their architecture on the von Neumann model and reduce implementation costs by controlling execution at the DNN layer granularity instead of at one or multiple data granularities [12], [14][16], [20], [21], [23]. This approach inherently serializes the execution of layers and significantly restricts the execution patterns for mapping computation and bandwidth resources. For example, most overlays only allow mapping one linear layer to their datapaths at a time [14][16], [20], [21], [23]. Such constraints can cause inefficiencies when the execution patterns cannot perfectly meet the diverse needs of different layers. For example, reusing the entire datapath to map one small layer may under utilize computing resources.

Motivated by observed constraints, we ask the research question: What is the right architecture that bridges software and hardware for a DNN overlay? Ideally, the desired abstraction should address two fundamental challenges:

  • Flexibility: Enable flexible execution patterns for mapping computation and bandwidth resources. Within a layer, phases such as prolog, steady-state, and epilog require different controls. Across layers, different operation types and shapes demand varied mapping patterns.

  • Heterogeneity: Modern FPGAs feature highly heterogeneous resources, such as BRAMs, DSPs, and AI engines, which require a unified abstraction to effectively manage this diversity and suit the dataflow nature of FPGAs. These resources are synchronized at the nanosecond level, needing precise coordination. However, using instructions at one or multiple data granularities, as in CPUs, is inefficient due to FPGAs’ much lower operating frequency.

Despite these challenges, DNNs present unique application opportunities. While they are compute/memory-intensive, the information entropy of their controls is very small relative to the data processed in the model. Each layer involves repetitive computations with relatively consistent control patterns, These inspire architectural innovations that provide flexible instruction-to-data granularity to bridge between coarse layer-level and fine data-level granularity, while still at low cost. Furthermore, the deterministic nature of DNN execution allows for compile-time analysis of data dependencies, eliminating the need for runtime discovery. On the hardware side, FPGAs excel in stream processing and customized datapath design, enabling efficient data communication through massively parallel streams. FPGAs’ reconfigurability supports customized instruction set architectures (ISAs) that expose only the necessary controls for datapath reuse to improve simplicity.

Figure 1: Reconfigurable stream network Overview

Driven by application and hardware insights, we propose a reconfigurable stream network architecture, in which the datapath is abstracted as a specialized circuit-switched network of stateful functional units (FUs), as shown in Fig. 1. Conceptually, programming a computation corresponds to triggering a circuit path in the network, with data sourced from off-chip, streamed through FUs, and then sunk back to off-chip. Multiple non-conflicting paths can be established to utilize available FUs for spatial parallelism between data-independent computations. The output of one path can feed the input of another path for pipeline parallelism between data-dependent computations. Software can control each FU individually to flexibly form execution paths, and these paths can be fully or partially reprogrammed to switch between different execution patterns, thereby allowing flexible resource mapping.

The contributions of this work can be summarized as:

  • Propose Reconfigurable Stream Network (RSN) architecture, detailing its principles and implementation.

  • Identify the architectural bottlenecks in existing overlays and develop the first known FPGA design to achieve dynamic sequential linear layer pipelining.

  • Build RSN-XNN on VCK190 and achieve a 22x latency reduction for BERT and 2.4x-3.2x better throughput for BERT, VIT, NCF, and MLP compared to [6].

  • Achieve the best GEMM implementation on VCK190.

  • Present a quantitative comparison with T4, V100, A100, and L4 GPUs that shows the need for FPGAs to continue integrating ASICs for high performance and bandwidth.

2 Background and Motivation↩︎

2.1 AI Hardware and Versal Architecture↩︎

The extreme computing demands of AI are driving trends to enhance hardware with AI processing capabilities or invent AI-specialized hardware. For instance, CPUs like Intel’s Xeon include a vector neural network instruction set for efficient SIMD processing [25], and GPUs like NVIDIA’s A100 enhance matrix computing capabilities with tensor cores [26]. However, advocates for domain-specific hardware point out the unnecessary area and energy overheads in CPUs/GPUs due to general-purpose hardware features like cache and complex dependency controls [1]. Google TPUs simplify their microarchitectures by removing caches and branch predictors, and using a deterministic execution model and software-managed scratchpads[1]. Groq’s AI processor uses a software-scheduled network for deterministic communication among tensor streaming processors [2]. While ASIC-based accelerators require a lengthy chip development cycle, FPGA-based accelerators offer a faster “time-to-solution” as they can reconfigure datapaths without retaping. However, traditional FPGAs have a low multiply-accumulate performance, mainly delivered by DSPs for traditional signal processing tasks [27]. To improve this, FPGA vendors are combining coarse-grained computing units with standard FPGA fabrics to exploit the high frequency, energy efficiency and unit density of ASICs. For example, AMD’s Versal ACAP combines the FPGA fabric with a processor array [28], and Intel’s Stratix NX integrates hardened tensor blocks directly into the FPGA fabric [29].

Figure 2: Versal ACAP VCK190 Block Diagram

Table 1: Inter-Linear Layer Execution Flexibility Comparison of DNN Accelerators
Supported Execution Features NPU, etc. [13][24] DLA [12] HPIPE, etc. [3][5] CHARM, etc. [6], [7] TGPA, etc. [8][11] RSN-XNN (this work) Tiled ASICs [30][36]
Software programmable \(\times\) \(\times\) \(\times\)
Exactly design one FU per layer \(\times\) \(\times\) \(\times\) \(\times\) \(\times\) \(\times\)
Allocate FUs based on layer shape \(\times\) \(\times\)
Spatially execute different layers \(\times\) \(\times\)
Temporally interleave layers \(\times\) \(\times\) \(\times\) \(\times\) \(\times\)
Sequential linear layer pipeline \(\times\) \(\times\) \(\times\)
Dynamic chain of pipelined FUs \(\times\) \(\times\) \(\times\) \(\times\) \(\times\)
Fine-grained off-chip BW mapping \(\times\) \(\times\) \(\times\) \(\times\)
The second to seventh columns target FPGA platforms, while the last column targets a tiled ASIC platform. The term ‘layer’ refers to a linear layer.

This work is built on the Versal VCK190 evaluation kit [37], the first FPGA kit with AI engines. Fig. 2 (a) depicts its system-level block diagram, which contains the processing system (PS), the programmable logic (PL), the AIE array, off-chip memories, and on-chip networks. PS contains ARM CPUs and PL contains traditional FPGA resources such as LUTs, FFs, DSPs, and on-chip RAMs (4MB of BRAMs and 16MB of URAMs). The AIE array has 8 rows by 50 columns of AIE tiles, providing a peak throughput of 8 TFLOP/s for FP32. Each AIE tile has a 1.25 GHz 7-way VLIW processor and 32KB local software-managed scratchpad memory. Moreover, the kit comes with one 8GB DDR4 and one 8GB LPDDR4, with a peak off-chip bandwidth of 25.6 GB/s and 32 GB/s.

2.2 Inter-Layer Execution for FPGA-based Accelerators↩︎

This section compares inter-layer execution patterns on FPGA-based accelerators. There are three common types. The generic reusable FU type uses a large, reusable FU to sequentially execute all layers [12][24], which often requires off-chip buffering of intermediate feature maps due to limited on-chip memory. To reduce it, Intel’s DLA [12] executes a generic FU at a tile granularity, processing a tile in one layer and immediately using it as input for a tile in the next layer without off-chip data movement. However, this is inefficient for DNN models with small layers that cannot be sufficiently unrolled to utilize all computing resources. The fully pipeline type allocates a customized FU for each layer, allowing direct forwarding of intermediate feature maps on chip between pipelined FUs. This reduces off-chip data accesses and improves customized use of resources [3][5]. However, deep models may not fit because the layer-dedicated buffers may exceed the on-chip capacity and a longer end-to-end latency may result from pipeline setup. Using smaller tiles can reduce pipeline setup latency but increase weight accesses since each tile requires access to the entire layer’s weight. HPIPE [4] places all weights on-chip to avoid off-chip weight accesses but it limits model size. The multiple reusable FUs type balances between the previous two types by creating multiple customized FUs and strategically assigning layers to them. DNNExplorer uses individual FUs for initial layers and a generic FU for later layers [7], while CHARM uses separate FUs for large and small layers [6]. Other designs use a static chain of pipelined FUs to execute one segment of DNNs at a time, with SSR [9] and [11] at a batch granularity, and TGPA [8] and [10] at a tile granularity. Although the number and size of FUs can be strategically explored, static pipelined datapaths inherently lead to mismatches and reduced performance when faced with diverse layers.

Table 1 compares inter-layer execution flexibility across different FPGA or ASIC-based DNN accelerators, including features such as FU allocation customization and spatial/temporal execution of different layers. The second and third columns reveal that current FPGA DNN overlays are surprisingly restricted. Current overlays typically adopt a generic reusable FU approach. Most of them execute one layer at a time [13][24], while DLA executes one tile at a time and features flexible stream buffers to manage data movement between off-chip and on-chip memory [12]. Fixed-function designs provide greater inter-layer flexibility; however, each design only offers partial flexibility and seldom supports fine-grained off-chip bandwidth mapping, like delayed data eviction [3][11]. The last column lists works on ASIC-based accelerators targeting tiled FUs interconnected via networks-on-chip (NoCs) [30][36]. These studies primarily focus on scheduling policies and rely on simulation-based experiments, often overlooking practical realization challenges by assuming that the computation cores can flexibly route data following the specified schedules, as seen in their codebases [38][41]. This table shows that current FPGA designs exhibit more restricted inter-layer flexibility compared to the flexibility assumed in ASIC-based studies, despite FPGAs allowing datapath customization. Is this because FPGA designs intentionally exclude unnecessary features? We believe this is sometimes the case. For example, our RSN-XNN design deliberately omits temporal layer interleaving because it does not benefit the targeted applications. However, a core issue is that implementing an ASIC-like software stack on FPGAs is not cost-effective due to their lower frequency compared to ASICs, and current overlays suffer from serialization at the layer granularity, as further discussed in Section 2.3.

Therefore, the primary challenge we address is to create an abstraction that enables overlays to achieve the necessary flexibility at a low instruction cost and area overhead. With the RSN architecture, RSN-XNN has the highest inter-layer flexibility among FPGA designs, close to the levels typically assumed in ASIC studies. It is the first known FPGA design that enables dynamic sequential linear layer fusion and supports fine-grained bandwidth mapping, including load/store interleaving and overlapping prolog/epilog of consecutive layers.

2.3 Overlay and Dataflow Architecture↩︎

Current DNN overlays primarily employ two styles of instruction set architectures: VLIW-like and RISC-like. The VLIW-like style, as adopted in Intel’s DLA [12] and NPU [15], and other works  [16], [20], [23], exploits massive parallelism by executing multiple FUs synchronously under a single wide instruction stream. For instance, Intel’s NPU uses VLIW instructions of five macro-operations, each directing a different stage of the datapath including a matrix-vector multiplication unit, a vector register file, two multi-function units, and a loader unit [15]. These macro-operations get decoded into sequences of micro-operations and issued to the pipeline stages. Alternatively, many works adopt a RISC-like instruction set, where each customized instruction maps to a straightforward operation [13], [14], [18], [19], [21], [22]. These operations can be broadly categorized into three types: 1) control: synchronize between the host and schedule workloads to functional units; 2) data movement: load and store data from and to off-chip memory; and 3) compute: execute DNN operations in specific modes such as sparse matrix multiplications, convolutional layers, or activation layers. For instance, Microsoft’s Brainwave uses a single-threaded, in-order model [14], where a chain of dependent instructions, such as vector read/write and matrix-vector multiply, controls the execution of a single layer. While previous two ISA styles dominate, FGPU [42] builds a GPU-like soft single instruction, multiple threads processor; however, this more general approach is rarely adopted in modern DNN overlays as it does not fully exploit the opportunity for datapath customization.

In practice, current overlays base their ISAs on the von Neumann model, in which a computer is logically composed of a central processing unit (containing an arithmetic logic unit and registers), a memory, and input/output interfaces. In this model, the arithmetic logic unit corresponds to the overlays’ matrix engines and miscellaneous engines (typically for operations like pooling, ReLU, etc.), while registers correspond to the overlays’ on-chip register files or buffers. Similarly, the logical memory typically corresponds to off-chip device memory. As instructions are executed sequentially, this model inherently serializes execution relative to the granularity of instructions. To reduce the cost of control units, the overlays predominately design their instructions to control execution at the layer granularity [13][24]; however, this approach inherently serializes inter-layer execution and significantly restricts the execution patterns. In contrast, the RSN architecture conceptualizes the datapath as a network of stateful functional units, rather than keeping the program state in the logical memory or register files. Moreover, our architecture achieves flexible instruction-to-data granularity, ranging from the layer level for a low instruction cost to multiple data level for precise control.

The dataflow architecture community has brought significant inspiration as we move away from the von Neumann model. Decoupled access/execute architecture [43] decouples operand access and execution through two separate instruction streams, and also proposes to conceptually merge these streams for easier programming and compilation. Stream dataflow [44][46] abstracts computation as a CGRA-mapped dataflow graph and data movement with streams and barrier. OverGen [47] extends stream dataflow tool-chains to generate non-DNN overlays controlled by soft RISC-V cores. Triggered instruction [48] removes the program counter and integrates architectural state registers into FUs, allowing each FU to control itself and react quickly to input channel messages. WaveScalar [49] is another dataflow processor without a program counter and has stateless FUs that immediately output results once triggered. DySER [50] integrates a circuit-switched network of stateless FUs into the execution stages of traditional processor pipelines. These studies do not directly apply to DNN overlays as they target different hardware or applications.

Figure 3: Functional Unit and Datapath Abstraction

3 Reconfigurable Stream Network Arch↩︎

3.1 Abstraction↩︎

Components: A reconfigurable stream network hardware consists of a datapath and an instruction decoder that controls it, with the datapath abstracted as a specialized circuit-switched network of stateful functional units. The data is sourced from off-chip, streamed and transformed through FUs, and ultimately sunk back off-chip. An FU comprises a micro-operation (uOP) decoder, input and output ports, and customized modules designed to transform and hold states, as depicted in Fig. 3. A uOP decoder is the interface that makes an FU controllable by software according to the uOP sequence received from the instruction decoder, which directs the method and amount of data to be transformed and communicated (control plane). Ports include streams used for data communication between nodes, allowing the transmission of a continuous sequence of data from one source FU to another destination FU (data plane). This communication is latency-insensitive, meaning that the correctness of execution does not depend on timing, and the FUs are stallable. Off-chip memories usually serve as the source and sink for the datapath through memory abstractions. Modules customized with state transformations (e.g., arithmetic operations and layout transformations) manipulate and process data, while state holders (e.g., buffers, registers, and FSMs) preserve both data and the associated execution states.

Execution Model: A computation is executed by triggering a circuit path within the FU network. This activates the designated FUs to perform the necessary transformations and manage communication along the interconnected edges. The state of a program is maintained within these FUs, and the actions of one FU are abstracted as a sequence of kernels, with each kernel representing an atomic step in transforming the FU state. The control plane of the kernels is derived from the uOPs, and each uOP triggers a single execution of the kernel. Each FU has its own sequence of uOPs and can only process one kernel at a time. Once a kernel execution is complete, the FU continuously fetches the next uOP from its attached uOP queue and stalls if no further uOPs are available.

Figure 4: Example of FU Kernels and Mapping Applications to FUs

A program specifies a set of kernels, and the data dependencies expressed through the input and output streams within these kernels. For every piece of data sent by an FU’s output port, the destination FU must have a receive operation at its corresponding input port. The number of sends from the producer kernel must exactly match the number of receives in the consumer kernels. If the sends are fewer than the receives, the receiving kernel will block indefinitely; if the sends exceed the receives, the producer kernel will block once the stream channel is full.

Fig. 4 shows an example datapath comprising three FUs and the mapping of two applications onto this datapath. FU1 can read \(N\) data from the source in at address addr and write to FU2 or FU3. FU2 can increment data from FU1 by 1 and forward it to FU3. FU3 can store \(N\) data from FU1 or FU2 to the sink out at addr. This datapath has three streams: FU1\(\rightarrow\)FU2, FU1\(\rightarrow\)FU3, and FU2\(\rightarrow\)FU3. To execute Application 1, which reads 100 input elements from the source in, increments each by 1, and stores them in the sink out, each FU operates with its own uOP sequence containing a single uOP. For instance, FU1’s uOP directs it to read 100 elements from address 0 and forward them to FU2, as commanded by passing the values (FU2, 100, 0) to the control plane. Application 2 is more complex, selectively incrementing data at addresses 0-99 and 200-299, while directly copying data at addresses 100-199. To execute this, FU2 still requires only one uOP, but with the data length increased from 100 to 200. Both FU1 and FU3 need three uOPs each to reconfigure the destination or source FUs. These examples demonstrates how the triggered paths can be reprogrammed flexibly to accommodate different applications and also the FU-level pipeline parallelism achieved by streams.

With the abstraction of the FU network, multiple paths can be triggered in the datapath at the same time. Multiple non-conflicting paths can be established to exploit spatial parallelism by triggering separate FUs for data-independent computations, while the output of one path can feed into another path to allow for pipeline parallelism in data-dependent computations. Fig. 5 exemplifies a flexible datapath that can either use all compute resources to execute a single layer or pipeline two sequential layers without sending intermediate data off-chip. If mapping one layer at a time, two independent paths are triggered: Path 1 uses compute1 FU, and Path 2 uses compute2 FU. When mapping two pipelined layers, Path 3 executes layer 1 by fetching input from the source and stops after saving the output of layer 1 into buffer1 FU. Subsequently, Path 4 begins by retrieving the output of Path 1 as the input for layer 2, uses compute2 FU for computation, and then stores the data back into the sink. This example shows how to program different execution patterns based on the needs of DNN layers. Additionally, we can observe that the behavior of compute1 FU and compute2 FU remains consistent regardless of the mapping styles. This indicates that when transitioning between mapping styles, there is no need to reprogram the entire datapath. We can control only buffer1, mesh1, and buffer2 FUs for partial path reprogramming to simplify instruction issuing and decoding costs.

Figure 5: A Flexible Datapath Enabling Dynamic Pipelining

3.2 Bridging DNN Applications and FPGA Hardware↩︎

With the RSN abstraction presented, this section explains its benefits in bridging DNN applications and FPGA hardware.

Flexible execution patterns: To execute a DNN model, each layer typically involves distinct phases, such as prolog, steady-state, and epilog. When the steady-state is small, precise control over datapath behavior during the non-steady states becomes crucial. Across layers, the diversity in operation types and shapes necessitates flexible inter-layer computation resource mapping. The RSN abstraction facilitates precise control through flexibly reprogrammable datapaths.

Inherent parallelism: The DNN models have massive concurrency. In the von Neumann architecture, program states are stored in logical memory and register files, with instructions executed sequentially. This inherently limits parallelism due to the limited number of ports in the memory and register files. In contrast, the data ports in an RSN datapath scale with the parallelism it can provide, as states are maintained within the FUs and distributed across the FU network.

Heterogeneity and customization: Modern FPGAs feature highly heterogeneous resources, such as BRAMs, DSPs, and AI engines, etc. Abstracting the actions that an FU can perform through kernels provides a unified way to manage this diversity and isolates the actual FU microarchitectures from the software. Modifications to an FU’s actions impact only its neighboring FUs, preserving the integrity of other connections. Moreover, abstracting FUs as network nodes aligns with FPGAs’ dataflow nature, allowing each node to perform customized tasks individually. Network edges provide exclusive data communication paths, harmonizing with the FPGAs’ capability for datapath customization.

Low instruction cost: Despite efficient DNN execution requires precise control, the total information entropy of the control for one DNN is very small compared to the data processed and layers often go through phases with repetitive computations. The RSN architecture leverages this property to reduce control costs by supporting flexible instruction to data granularity and partial path reprogramming, as different FUs and compute phases demand different levels of control effort. Additionally, data is locally synchronized between producer and consumer FUs through streams at the edges, avoiding centralized data dependency management, as data is not carried by the arrival of instructions.

Determinism: RSN execution model does not support for prediction or speculative execution. While these features could potentially be added in the future, we justify the current choice by noting that DNN executions are generally deterministic. The order of execution and data dependencies is known at compile time, and there is no need for runtime discovery. Runtime uncertainties in the system, such as variable DRAM latency and NoC transfer latency, are handled by latency-insensitive streaming protocols.

Figure 6: Instruction Decoder: Fuse uOP Streams into 1 RSN Instruction Stream

3.3 Instruction Decoder↩︎

While the concept of providing a one instruction stream to one FU is straightforward, it suffers from the drawbacks of duplicated instruction fetch/decode units and increased programming complexity. Similar to the solutions provided in the decoupled access/execute architecture [43], we address this issue by physically merging multiple instruction streams into a single stream. As illustrated in Fig. 6, the program is stored as a single sequence of RSN instructions, and the instruction decoder issues uOP sequences to the FUs. Instead of merely interleaving different uOP sequences into a single RSN sequence, we introduce an intermediate level of decoding to enable RSN instruction reuse and enhance code efficiency. For a full explanation, Top-level decoder receives an RSN instruction sequence consisting of UDP-like instruction packets, each with a 32-bit header and a payload section. The header includes: 1) opcode: FU type; 2) mask: selects targeted FUs; 3) last: signals FU exit; 4) window size: the number of RSN instructions in this packet; 5) reuse: how many times this packet will be reused. The decoder converts the payload into macro-operations (mOPs) and sends them to the second-level decoders, with destinations specified by opcode and mask, and the number of mOPs specified by window size. Second-level decoders enable instruction packet reuse to improve code efficiency. A decoder first looks at the header to determine the window size and reuse of an instruction packet. It then processes the specified window size of mOPs, converts them into uOPs, stores them locally, and forwards them to the third-level decoder repetitively for the specified reuse count. This mechanism addresses common scenarios where a small sequence of uOPs is repeated, such as when an FU needs to send data to FU1 and then FU2, repeating the process 128 times. In this case, an instruction packet with a window size of 2 and a reuse of 128 can be created. FPGA’s reconfigurability allows further customization. For example, we add stride size and stride count to some FUs to support strided off-chip accesses. Third-level decoders are attached to FUs and translate uOPs to control kernel execution.

Figure 7: RSN vs Translated uOP Size for Different FU Types

Decoding efficiency: Fig. 7 compares the size of RSN instructions needed to execute one BERT-Large encoder in our design with the size of the translated uOPs, categorized by FU types. This figure shows the diversity of controls required for different FU types. In terms of uOP size, FUs (LPDDR, DDR) that interact with off-chip memory require more controls than FUs (MeshA/B, MemA/B/C) that transform data on-chip through streaming interfaces. Moreover, the control of the DDR FU is more complex than that of the LPDDR FU because we enable complex interleaving of load/store feature maps in DDR, while LPDDR is only used to load read-only weights and biases. For the compression ratio of the RSN instruction to uOPs, the LPDDR FU and DDR FU have relatively lower ratios (\(2 \sim 4.2\)x) compared to other FUs (\(6.8 \sim 22.7\)x) because it is more difficult to exploit common patterns when addressing off-chip memories than when accessing stream interfaces.

Deadlock: A decoder is back-pressured if its downstream FIFO is full. The fetch unit issues instructions continuously until it encounters a stall. For instance, when FU1 is executing and waiting for FU2 to consume data, the fetch unit can stall because FU1 is unable to accept new uOPs. It remains stalled until FU1 completes its current kernels and consumes the next uOP. In such cases, a deadlock may occur if the fetch unit stalls before fetching the instruction that directs FU2 to consume the data from FU1. To address it, increasing the FIFO depth in the decoding system can help fetch units continue to go get subsequent instructions. While comprehensive deadlock prevention is more complex and beyond the scope of this paper, we report that setting FIFO depths to six between uOP and mOP decoders is deadlock-free in our implementation.

Figure 8: RSN-XNN Datapath and Example Application

4 RSN-XNN: A RSN Case Study↩︎

4.1 Datapath and Instruction Set Overview↩︎

We implement a proof-of-concept RSN hardware for transformer encoders, called RSN-XNN. Fig. 8 illustrates its datapath. The AIE side has six MME (matrix multiplication engine) FUs that receive streaming inputs for the left-hand side (LHS) operands from MeshA FUs and for the right-hand side (LHS) operands from MeshB FUs, and send the results to MemC FUs on the PL side. Additionally, LPDDR FU loads weights and bias from off-chip LPDDR, while DDR FU manages the loading and storing of feature maps from off-chip DDR. The decoder unit fetches the RSN instruction sequence and issues uOP sequences to the FUs on the PL side. Since AIE tiles are processors with their own instruction memory, the uOPs for MME FUs are pre-stored locally and are not interleaved into the main single instruction sequence.

Table 2 lists the control planes for different types of FUs in RSN-XNN. MME FUs mainly perform matrix multiplications in a tiled manner and also support several non-MM operations, including adding bias, adding the output of a previous layer to the current layer, and applying scale and shift in LayerNorm. DDR/LPDDR FUs route data between off-chip and on-chip memory and MeshA/B FUs route data between AIE and on-chip memory. A fine-grained load/store interleaving can be achieved by orchestrating the uOP sequence for DDR. MME FUs can be flexibly grouped by modifying data routing in MeshA/B FUs, enabling the use of all six MME FUs for a single matrix multiplication or the pipelined execution of multiple matrix multiplications. MemA/B/C serves as flexible and fast scratchpads to increase off-chip data reuse, and they are double buffered to allow the overlapping of computation and data movement. They also implement some non-MM operations such as softmax and GELU, transpose, and the mean, variance, and normalization operations in LayerNorm.

Fig. 8 also presents an example application performing a 1x8x4 tile matrix multiplication, which triggers 2 MME, 1 MemA, 2 MemB, and 2 MemC FUs. MemA0 first expects the arrival of the first tile of LHS data from DDR FU. Then, it sends the previous tile to MeshA and receives the new tile from DDR, repeating this process 15 times. Finally, it sends the last tile. MemB0/1 and MemC0/1 FUs are controlled similarly, using three instructions to manage the prolog, steady state, and epilog phases, respectively. MeshA/B FUs fan in and fan out data between MemA/B and MME FUs. MeshA reads LHS data from MemA0 and copies it to both MME0/1, while MeshB passes RHS data from MemB0 to MME0 and from MemB1 to MME1. Their actions are only set once because the dataflow remains the same. LPDDR FU loads 16 tiles of RHS data and alternates sending them to MemB0/1 FUs. The DDR FU operates the same as Way 1 in Fig. 11, storing 2 OUT tiles per 8 LHS input tiles loaded.

Table 2: Control Planes for Different Types of FUs in RSN-XNN
FU Control Plane
MME matrix size, tile size, add bias /, add previous layer /, calculate scale and shift /, accumulate along k /.
DDR addr, stride size, stride offset, stride count, load /, destFU, store /, srcFU.
LPDDR addr, stride size, stride offset, stride count, destFU, load bias /.
MeshA/B size, srcFUs, destFUs.
MemA matrix size, tile size, srcFU, load data /, send to MME /.
MemB matrix size, tile size, load data /, send to MME /, transpose input /, load bias /.
MemC matrix size from MME, matrix size to DDR, tile size from MME, tile size to DDR, receive from MME /, send to MME /, softmax /, gelu /, mean/variance/normalization /.

4.2 Decision Process of Datapath Generation↩︎

The datapath generation process includes three main stages:

Model segmentation: We begin with a first-order formula-based calculation to segment targeted models so that resources could be mapped efficiently. Compute-bound layers are segmented individually, whereas multiple memory-bound layers are grouped together and executed in a pipelined manner to reduce off-chip data accesses, as detailed in Section 4.3. Moreover, multiple layers can also be grouped to overlap the prolog and epilog phases of layers, as detailed in Section 4.4.

Single segment analysis: This stage involves analyzing data movement and transformations segment by segment. For each segment, we decide on the on-chip buffer size for each operand to ensure sufficient reuse of off-chip data, computation resource allocation across different layers, datapath to fuse matrix multiplication and non-MM operations, and data layout transformations between PL/off-chip and PL/AIE.

Collective datapath construction: This stage reviews all segments collectively to determine a “union” datapath that encompasses all segment requirements while minimizing unnecessary edges and FUs. We initially assume a fixed-function style for the entire datapath, aiming to split the datapath and create new FUs only when divergent datapath reuse patterns are necessary. For example, Mesh FUs are not created for outputs returning from AIE to PL, as each MME consistently communicates with the same MemC. Also, although each MemA FU has local memories that can process 256 floats in parallel, we opt not to further partition it because no segment demands more fine-grained data movement patterns. Furthermore, we expose only the necessary controls for datapath reuse to the ISA. For instance, we do not expose data layout transformations from the PL side to the AIE side because the layout transformations remain the same across all segments.

Figure 9: Four Mapping Types and Their Disadvantages

Table 3: Latency Estimation for Four Mapping Types
Latency if inf. FLOPS Used AIE Latency per MM if inf. BW Latency if inf. BW Final Latency
A MM1 2.22 ms 64% 0.81 ms 2.43 ms 2.43 ms
2-2 MM2 1.62 ms
B MM1 10.9 ms 64% 0.81 ms 2.43 ms 10.9 ms
2-2 MM2 1.62 ms
C MM1 10.9 ms 96% 0.54 ms 1.08 ms 10.9 ms
2-2 MM2 0.54 ms
D MM1 2.24 ms 96% 0.81 ms 1.62 ms 2.24 ms
2-2 MM2 1.62 ms
Attention layer in BERT-Large, B=6, SeqLen=512. Softmax is ignored.
MM1: Key\(\times\)Query, 512\(\times\)​64\(\times\)​512, Number=96.
MM2: MM1’s Output\(\times\)Value, 512\(\times\)​512\(\times\)​64, Number=96.

Figure 10: Pipeline Non-MMs and their Adjacent MMs

Figure 11: Three Ways to Map Load and Store Operations to One DDR Channel

4.3 Compute Resources Mapping↩︎

DNN inference has massive parallelism at varying granularities ranging from data, tiles, and layers to batches and models. Figure 9 illustrates four mapping types in two parallel two-stage tasks. Task-by-task mapping completes one task before starting the next and it can under utilize resources if tasks are too small. Stage-by-stage mapping completes each stage of all tasks first before moving to the next stage, which may require off-chip memory storage and result in high off-chip accesses. Task-parallel mapping divides resources spatially to execute multiple tasks in parallel, improving resource utilization but requiring separate on-chip buffers for each task’s input and output. Pipeline mapping pipelines stage execution to enhance resource utilization and reduce off-chip accesses, but it incurs longer latency from pipeline setup.

Table 3 compares the execution of two stages of small matrix multiplications (MMs) in BERT-Large’s attention layers using different mapping types on VCK190 hardware budgets, estimated using roofline model formulas. Types B and C have high latency due to large off-chip feature map accesses between the two MM stages. Type A has low AIE utilization because the MMs are too small. Type D achieves the lowest latency with removed off-chip feature map accesses and high AIE utilization, while the extra latency caused by pipeline setup is negligible. While this table shows the benefits of using the pipeline mapping type for attention layers, larger layers, such as the feedforward layers with large MMs, require different mapping styles. Storing intermediate feature maps between two MMs in BERT-Large’s feedforward layers requires over 25 MB of buffers, exceeding our on-chip capacity. However, these MMs are large enough to reach high AIE utilization when executing one MM at a time. As static mapping does not suit all cases, it is critical to make hardware capable of dynamically switching between mapping types at runtime.

Non-MM operators are fused with their adjacent MM operations using a pipeline mapping style to avoid off-chip memory accesses and hide the latency of executing non-MMs within the time taken for MMs. As shown in Fig. 10, Softmax in attention layers occurs between two PL modules: 1) receive the 1st MM results from the AIE and 2) send the 2nd MM inputs to the AIE. A ping-pong buffer mechanism is used to allow overlapping between RCEV and SEND operations, enabling pipeline execution between the two MMs. We insert Softmax after RCEV and before the ping-pong buffers flip. This is because RCEV is shorter than SEND (8.4 vs. 16.8 µs for BERT-Large) in our mapping strategy, and scheduling Softmax together with RCEV utilizes the idle time during which RCEV waits for SEND to complete. The final throughput for processing the attention layers is determined by the maximum of the latency taken by RCEV plus Softmax, and the latency taken by SEND. Using a similar analysis, we insert the GELU operation right after the ping-pong buffers flip and before the start of storing the final result back to off-chip memory.

4.4 Bandwidth Resources Mapping↩︎

In RSN-XNN, instructions can explicitly specify fine-grained off-chip load and store interleaving to maximize the utilization of off-chip bandwidth. Fig. 11 shows an MM example where the input tiles are loaded from one DDR channel along the K dimension 8 times, followed by sending the completed output tile off-chip. If the load-compute-store order is strictly followed, the computation for the second output tile will stall when the first one is stored back to DDR. One way to improve this is to schedule the loading of 8 input tiles for the second output simultaneously with the storing of the first output tile. This can be achieved by pushing requests to the AXI read/write channels and letting the hardware controller arbitrate the execution. However, since the hardware controller does not have application-level information, it schedules loads and stores non-deterministically and cannot guarantee the optimal ordering. Instead, we explicitly specify the DDR load and store ordering using instructions, carefully interleaving the load/store operations to maximize bandwidth utilization. In addition to controlling execution within a single layer, we also support overlapping the prolog and epilog phases of layers by interleaving the storing of the last output tile from the previous layer with the loading of the first input tile for the current layer. This is particularly useful in scenarios involving many small layers, where the prolog and epilog phases are significant due to the small steady phases, such as in attention layers. Given that DNN inference is generally deterministic to allow for predictable memory utilization, it is beneficial to provide the flexibility for precise control of off-chip memory accesses, enabling optimal scheduling based on specific workloads and hardware capabilities.

4.5 Compatibility of High-level Program↩︎

Figure 12: Domain Specific Library Usage Example

A domain-specific library, RSNlib, has been developed to generate RSN instructions from high-level Python code. Fig. 12 presents an example code that specifies the transformer encoder model. The library processes PyTorch models composed of RSNlib operators according to a predefined execution schedule. It employs a template-based approach to validate whether the model and schedule align with supported backend patterns. It shows that the proposed overlay software can be compatible with existing compiler infrastructures through library-based methods. Exploring the automatic generation of the datapath from arbitrary input code is beyond the scope of this paper but could be a topic for future research.

5 Evaluation↩︎

Experiment setup: We implement RSN-XNN on the VCK190 Evaluation Kit [37] and program the PL side using HLS with the Vitis 2023.2 software [51]. We measure the execution latency on the CPU host using the std::chrono clock. All experiments use the same bitstream, varying the instructions passed to the datapath to accommodate different applications. We source inputs and weights for BERT-Large from Hugging Face [52], load them onto the board, and validate the outputs against reference results. We obtain latency for the T4, V100, and A100 GPUs from NVIDIA’s state-of-the-art reports [53] and measure L4 GPU latency on Google Colab. We measure the GPU power using nvidia-smi and the VCK190 power using Xilinx’s BEAM [54].

Precision: Experiments use FP32 precision. Although FP16 is preferred, the VCK190 supports only FP32, INT16, and INT8. INT16 is uncommon, and INT8 causes significant accuracy drops in BERT-Large, as Intel studied [55].

Figure 13: Device View of the Implemented Design

Total area: The FPGA runs at 260 MHz, and the AIE at 1.25 GHz. Fig. 13 shows the device view of the routed design. The final FPGA resource utilization includes 598,144 registers (33%), 494,855 LUTs (55%), 1,073 DSP blocks (55%), 967 BRAMs (59%), and 463 URAMs (41%).

Design Device LUT FF DSP BRAM
RSN-XNN VCK190 11.7k(3%) 8.6k(2.5%) 5(0.5%) 4(0.3%)
DFX U280 3k(0.6%) 13k(1.2%) 0 24(2%)
DLA Arria10 Use 2046 ALMs (7% of total ALMs on board). Total design area is unreported.
Table 4: Overlay Area Overhead And Utilization Comparison
Design Precis. Peak Perf. (TFLOPS) Off-chip BW (GB/s) Achieved Perf. (TFLOPS) Util.(%)
RSN-XNN FP32 8 57.6 4.7 59
DFX FP16 1.2 460 0.19 16
DLA does not report achieved performance in FLOPS. Achieved throughput performances occur under similar transformer computations: RSN-XNN for BERT-Large and DFX for GPT-2 prefill stages.

Area overhead: Table 5 presents the area overhead of instruction decoder units across different designs. The data shows that RSN-XNN maintains a low overhead, both in absolute terms and as a percentage of the total design area. The table also includes data for two overlays: DLA[12], which controls execution at the tile level, and DFX[17], which controls execution at the layer level. RSN-XNN’s area overhead is comparable to existing overlays but offers greater flexibility and better resource utilization. As the instruction processing rate is very low (for the workload in Fig. 7, RSN-XNN has an average RSN instruction processing rate of only 1.4 MB/s, and an average compute-to-instruction ratio of up to 3.15 GFLOPS/Byte), area can be saved by slowing down the decoder unit. We employed several low-cost techniques, such as increasing the loop initiation interval and allowing multiple cycles to decode an instruction.

5.1 GEMM Performance↩︎

Method AIE TileSize (MxKxN) Used AIE Throughput (GFLOPS) Gain
CHARM [6] 32x32x32 384 (96%) 4504.46 + 0%
MaxEVA [56] 32x32x32 390 (98%) 5442.11 + 20.8%
AMA [57] 32x32x32 342 (86%) 5867.29 + 30.3%
RSN-XNN 32x16x32 384 (96%) 6095.64 + 35.3%
RSN-XNN 32x32x16 384 (96%) 6306.02 + 40.0%
RSN-XNN 32x32x32 384 (96%) 6784.96 + 50.6%
Table 5: Matrix Multiplication Throughput Comparison
Square MM Size CHARM [6] RSN-XNN Gain
1024 1103.46 2982.62 + 170.3%
3072 2850.13 6600.12 + 131.6%
6144 3277.99 6750.93 + 105.9%
CHARM only uses DDR memory. MaxEVA and AMA do not explore end-to-end throughput with DRAM.

Figure 14: Reuse of AIE To/From PL Streams

Table 7 shows the performance of a single GEMM kernel with data either generated from the PL side or transferred from DRAM. Our AIE programming achieves a gain of 0.92 TFLOPS compared to the state-of-the-art AMA [57]. Fig. 14 shows our optimization strategy. Each AIE tile uses two input streams and one output stream. With 400 AIE tiles in VCK190, using all tiles requires 800 input streams and 400 output streams, but VCK190 allows for only 234/156 64-bit input/output streams between the AIE and PL. To address this, we group AIE tiles to share input streams. Output streams are reduced by chaining tiles together, cascading data to the next tile instead of sending it to the PL. We create 6 groups, each corresponding to one MME FU and containing 64 tiles in a 4x4x4 format, reusing LHS/RHS/output streams 4 times. This setup utilizes 384 AIE tiles, 192 input streams, and 96 output streams, all within the resource budgets.

Although off-chip memories theoretically offer 25.6 GB/s for DDR and 32 GB/s for LPDDR, the peak observed bandwidths are 21 GB/s (DDR reads), 23.5 GB/s (DDR writes), and 20.5 GB/s (LPDDR reads). To achieve the peak performance of 6.78 TOPS, each loaded weight must be reused over 661 times. DDR is used for both feature map loading and storing, requiring even larger reuse. We implement an output-stationary MM tiling scheme on the PL side, with the LHS tile size set to 768x128, the RHS to 128x1024, and the output to 1024x1024, allowing for complete accumulation along the K dimension before storing off-chip. This setup enables 768x reuse of RHS, 1024x reuse of LHS, and an efficient output accumulation. We optimize further by finely interleaving load and store operations using RSN instructions. To reduce strided off-chip memory accesses, data is stored in a 128x64 blocked layout off-chip, and MemA/B/C handle on-chip conversion from blocked to row-major or transposed format. Table 7 shows the effectiveness of our solution compared to [6].

Figure 15: Achieved Latency/Throughput VS CHARM [6]

Table 6: Comparison of Latency per Task at Maximum Throughput
Model BERT VIT NCF MLP
CHARM 57.2 ms 57.7 ms 40.4 ms 119 ms
RSD-XNN 17.98 ms 23.7 ms 16.1 ms 42.6 ms

5.2 Comparison to SOTA FPGA Design↩︎

BERT-Large 1 st Encoder, Sequence Length = 512, Batch = 6, FP32.

Table 7: Execution Details of Different Model Segments
MMs Size (M\(\times\)K\(\times\)N\(\times\)Num) Combined non-MMs No Optimize (ms) BW Optimized (ms) Multi MMs together (ms) Final (ms)
Key 3072\(\times\)​1024\(\times\)​1024\(\times\)​1 Bias 1.667 (1x) 1.276 (1.31x) 3.584 (1.4x) Overlap prolog/epilog 17.98 (2.47x)
1-5 Query 3072\(\times\)​1024\(\times\)​1024\(\times\)​1 Bias 1.667 (1x) 1.276 (1.31x)
1-5 Value 3072\(\times\)​1024\(\times\)​1024\(\times\)​1 Bias 1.667 (1x) 1.276 (1.31x)
1-6 Attention MM1 512\(\times\)​64\(\times\)​512\(\times\)​96 Transpose, Softmax 10.55 (1x) 2.618 (8.52x) Pipeline MMs + overlap
1-5 Attention MM2 512\(\times\)​512\(\times\)​64\(\times\)​96 11.75 (1x)
1-6 Dense 3072\(\times\)​1024\(\times\)​1024\(\times\)​1 LayerAdd, Scale & Shift, Bias, Mean & Var, Norm 2.913 (1x) 2.035 (1.43x) 11.88 (1.45x) Overlap prolog/epilog
1-5 Feedforward MM1 3072\(\times\)​1024\(\times\)​4096\(\times\)​1 Bias, GELU 8.492 (1x) 5.501 (1.55x)
1-5 Feedforward MM2 3072\(\times\)​4096\(\times\)​1024\(\times\)​1 LayerAdd, Scale & Shift, Bias, Mean & Var, Norm 5.764 (1x) 4.811 (1.20x)

Fig. 15 compares the latency and throughput of RSN-XNN against the state of the art [6] at varying batch sizes for the first encoder of BERT-Large. Our best latency is 5 ms at batch size (B)=1, which is 22x faster than their best latency of 110 ms at B=6. At the same batch size, we achieve speedups ranging from 6.1x (B=6) to 3.3x (B=24). Regarding throughput, our performance nearly saturates at B=3 (97% of the peak) and reaches a peak of 333.76 tasks/sec at B=6, which is 3.25x better than CHARM’s best throughput at B=24. Their approach designs two separate MM engines for small and large layers, requiring them to schedule at a 6-batch granularity and interleave the execution of four 6-batches to fully overlap small and large MME executions. In contrast, RSN-XNN dynamically switches between pipeline and non-pipeline execution to efficiently handle small and large layers, allowing for single-batch execution.

Table 8 compares the latency per task at maximum throughput for four applications: BERT, VIT, NCF, and MLP, against CHARM. All task size configurations align with CHARM’s implementations. RSN-XNN achieves throughput improvements of 3.2x, 2.4x, 2.5x, and 2.8x for BERT, VIT, NCF, and MLP, respectively. Moreover, CHARM necessitates redesigning the datapath for different applications, whereas our design uses the same datapath for all applications.

5.3 Segment-wise Latency Breakdown Analysis↩︎

Table [table:mm95operations] provides a latency breakdown of different model segments and the effects of different optimization techniques. With fine-grained load and store interleaving, the first three large MMs achieved a 1.31x latency speedup, while the last three large MMs achieved speedups of 1.43x, 1.55x, and 1.20x, respectively. Small MMs in the attention layers struggle with sequential execution due to large off-chip access for storing/loading feature maps, low MME utilization, and a short steady-state for overlapping memory access and computation. We observed a 8.52x speedup by executing attention MM1 and MM2 in a pipelined manner and overlapping the prolog/epilog across parallel attention heads. Compared to a typical overlay style that executes layers sequentially without fine-grained bandwidth mapping, RSN-XNN achieves a 2.47x speedup.

5.4 Comparison to GPUs↩︎

Table 10 compares the performance and energy efficiency of RSN-XNN with the NVIDIA T4 [58], V100 [59], A100 [26], and L4 GPUs [60]. On the VCK190, we execute the encoder layer 24 times to obtain latency, with the embedding layers considered negligible (less than 0.2 ms on the T4). For the T4 GPU with the same 8 TFLOPS FP32 performance, RSN-XNN achieves slightly better latencies at B=2, B=4, and B=8, despite the VCK190 having only 58 GB/s of off-chip memory bandwidth compared to the T4’s 320 GB/s. When B=1, our latency is worse than the T4’s (0.7x) because the weights can only be reused up to 384 times, half of the 661 times needed for peak performance due to our low off-chip memory bandwidth. For the A100 GPU under the same 7nm process node, RSN-XNN achieves 2.1x/4.5x better operating/dynamic energy efficiency in FP32 but slower latency because of A100’s superior peak performance and bandwidth. Compared to the modern energy efficient L4 GPU, RSN-XNN is 0.69x slower but slightly more energy efficient in FP32. All GPUs should reach saturation in FP32 at B=8, as indicated by the latency trend with increasing batch size. These statistics offer quantitative insights into the performance and energy differences between an efficient Versal design and GPUs for a large MM workload. We also offer a data point using the A100 in FP16, showing much better latency and twice the energy efficiency due to its FP16 performance being 39 times greater than the VCK190’s peak FP32 performance. This underscores the need for FPGAs to continue integrating ASIC efficiency for enhanced performance and bandwidth.

BERT-Large, Sequence Length=384, FP32.

Table 8: Comparison of T4, V100, A100, L4, and VCK190
T4 V100 A100 A100 L4 VCK190
Precision FP32 FP32 FP32 FP16 FP32 FP32
Release Date 2018 2017 2020 2023 2021
Process (nm) 12 12 7 5 7
Peak Perf. (TFLOPS) 8.1 15.7 19.5 312 30.3 8.0
Off-chip BW (GB/s) 320 900 1555 300 57.6
Die Area (mm2) 545 815 826 294 \(\leq\) 458 [61]
Latency (ms) by Batch Size
B = 8 499 182 137 23 307 444
B = 4 258 93 72 15 156 220
B = 2 127 49 40 10 83 122
B = 1 67 29 23 8 41 95
Energy Efficiency (Batch Size = 8)
Operating Power (W) 72 292 308 392 72 45.5
Dynamic Power (W) 42 256 268 352 41 18.2
Opt. Efficiency (Seq/J) 0.22 0.15 0.19 0.89 0.36 0.40
Dy. Efficiency (Seq/J) 0.38 0.17 0.22 0.99 0.64 0.99

5.5 Sensitivity to Off-chip Memory Bandwidth↩︎

Table 11 shows the effect of varying bandwidth on the latency of the BERT-Large model. We simulate different bandwidths by adjusting the amount of data moved from/to off-chip. For example, we halve the data transferred from off-chip and pad the remainder on-chip to simulate 2X BW. The first two data columns represent the theoretical minimum achievable latency if the bandwidth were infinite and there were no compute setup overheads, and if the computational resources were infinite. Increasing bandwidth does not yield significant benefits, suggesting that the current use of bandwidth is already highly efficient. Also, it shows that the current execution achieves 78.6% utilization of the peak bandwidth.

Table 9: BERT-Large Sweep Bandwidth Analysis, SeqLen=384, B=8
Scenario Infinite BW & No setup Infinite compute 0.5X BW 1X BW 2X BW 3X BW
Latency (ms) 311 349 704 444 387 372
Speedup 1.43 1.27 0.63 1 1.15 1.19

6 Conclusion↩︎

This paper proposes and builds reconfigurable stream network architecture, a new paradigm for creating FPGA overlays to elegantly balance specialization and generality.

References↩︎

[1]
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of a tensor processing unit,” SIGARCH Comput. Archit. News, vol. 45, p. 1–12, jun 2017.
[2]
D. Abts, G. Kimmell, A. Ling, J. Kim, M. Boyd, A. Bitar, S. Parmar, I. Ahmed, R. DiCecco, D. Han, J. Thompson, M. Bye, J. Hwang, J. Fowers, P. Lillian, A. Murthy, E. Mehtabuddin, C. Tekur, T. Sohmers, K. Kang, S. Maresh, and J. Ross, “A software-defined tensor streaming multiprocessor for large-scale machine learning,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA ’22, (New York, NY, USA), p. 567–580, Association for Computing Machinery, 2022.
[3]
X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu, and D. Chen, DNNBuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs,” in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–8, 2018.
[4]
M. Hall and V. Betz, HPIPE: Heterogeneous layer-pipelined and sparse-aware CNN inference for FPGAs,” in Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’20, (New York, NY, USA), p. 320, Association for Computing Machinery, 2020.
[5]
H. Chen, J. Zhang, Y. Du, S. Xiang, Z. Yue, N. Zhang, Y. Cai, and Z. Zhang, “Understanding the potential of FPGA-based spatial acceleration for large language model inference,” ACM Transactions on Reconfigurable Technology and Systems, 04 2024.
[6]
J. Zhuang, J. Lau, H. Ye, Z. Yang, Y. Du, J. Lo, K. Denolf, S. Neuendorffer, A. Jones, J. Hu, D. Chen, J. Cong, and P. Zhou, CHARM: Composing heterogeneous accelerators for matrix multiply on VersalACAP architecture,” in Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA ’23, (New York, NY, USA), p. 153–164, Association for Computing Machinery, 2023.
[7]
X. Zhang, H. Ye, J. Wang, Y. Lin, J. Xiong, W.-m. Hwu, and D. Chen, DNNExplorer: a framework for modeling and exploring a novel paradigm of FPGA-based DNN accelerator,” in Proceedings of the 39th International Conference on Computer-Aided Design, ICCAD ’20, (New York, NY, USA), Association for Computing Machinery, 2020.
[8]
X. Wei, Y. Liang, X. Li, C. H. Yu, P. Zhang, and J. Cong, TGPA: Tile-grained pipeline architecture for low latency CNN inference,” in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–8, 2018.
[9]
J. Zhuang, Z. Yang, S. Ji, H. Huang, A. K. Jones, J. Hu, Y. Shi, and P. Zhou, SSR: Spatial sequential hybrid architecture for latency throughput tradeoff in transformer acceleration,” in Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA ’24, ACM, Apr. 2024.
[10]
X. Cai, Y. Wang, X. Ma, Y. Han, and L. Zhang, DeepBurning-SEG: Generating DNN accelerators of segment-grained pipeline architecture,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1396–1413, 2022.
[11]
Y. Shen, M. Ferdman, and P. Milder, “Maximizing CNN accelerator efficiency through resource partitioning,” SIGARCH Comput. Archit. News, vol. 45, no. 2, p. 535–547, 2017.
[12]
M. S. Abdelfattah, D. Han, A. Bitar, R. DiCecco, S. O’Connell, N. Shanker, J. Chu, I. Prins, J. Fender, A. C. Ling, and G. R. Chiu, DLA: Compiler and FPGA overlay for neural network inference acceleration,” in 2018 28th International Conference on Field Programmable Logic and Applications (FPL), pp. 411–4117, 2018.
[13]
Y. Yu, C. Wu, T. Zhao, K. Wang, and L. He, OPU: An FPGA-based overlay processor for convolutional neural networks,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 1, pp. 35–47, 2020.
[14]
J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel, A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M. Caulfield, E. S. Chung, and D. Burger, “A configurable cloud-scale DNN processor for real-time AI,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 1–14, 2018.
[15]
A. Boutros, E. Nurvitadhi, R. Ma, S. Gribok, Z. Zhao, J. C. Hoe, V. Betz, and M. Langhammer, “Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs,” in 2020 International Conference on Field-Programmable Technology (ICFPT), pp. 10–19, 2020.
[16]
H. Khan, A. Khan, Z. Khan, L. B. Huang, K. Wang, and L. He, NPE: An FPGA-based overlay processor for natural language processing,” in The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’21, (New York, NY, USA), p. 227, Association for Computing Machinery, 2021.
[17]
S. Hong, S. Moon, J. Kim, S. Lee, M. Kim, D. Lee, and J.-Y. Kim, DFX: A low-latency multi-FPGA appliance for accelerating transformer-based text generation,” in 2022 IEEE Hot Chips 34 Symposium (HCS), pp. 1–17, 2022.
[18]
S. Zeng, J. Liu, G. Dai, X. Yang, T. Fu, H. Wang, W. Ma, H. Sun, S. Li, Z. Huang, Y. Dai, J. Li, Z. Wang, R. Zhang, K. Wen, X. Ning, and Y. Wang, FlightLLM: Efficient large language model inference with a complete mapping flow on FPGAs,” in Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA ’24, (New York, NY, USA), p. 223–234, Association for Computing Machinery, 2024.
[19]
T. Moreau, T. Chen, L. Vega, J. Roesch, E. Yan, L. Zheng, J. Fromm, Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy, “A hardware–software blueprint for flexible deep learning specialization,” IEEE Micro, vol. 39, no. 5, pp. 8–16, 2019.
[20]
S. Hur, S. Na, D. Kwon, J. Kim, A. Boutros, E. Nurvitadhi, and J. Kim, “A fast and flexible FPGA-based accelerator for natural language processing neural networks,” ACM Trans. Archit. Code Optim., vol. 20, feb 2023.
[21]
H. Ye, X. Zhang, Z. Huang, G. Chen, and D. Chen, HybridDNN: a framework for high-performance hybrid DNN accelerator design and implementation,” in Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference, DAC ’20, IEEE Press, 2020.
[22]
B. Zhang, H. Zeng, and V. K. Prasanna, GraphAGILE: An FPGA-based overlay accelerator for low-latency GNN inference,” IEEE Transactions on Parallel & Distributed Systems, vol. 34, no. 09, pp. 2580–2597, 2023.
[23]
Z. Guo, K. Liu, W. Liu, X. Sun, C. Ding, and S. Li, “An overlay accelerator of DeepLabCNN for spacecraft image segmentation on FPGA,” Remote Sensing, vol. 16, p. 894, 03 2024.
[24]
AMD, DPUIP details and system integration,” 2023. Available: https://xilinx.github.io/Vitis-AI/3.5/html/docs/workflow-system-integration, Accessed: 5-November-2024.
[25]
Intel, “Intel deep learning boost,” 2019.
[26]
NVIDIA, “Nvidia A100 tensor core GPU,” 2021. Available: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf, Accessed: 21-November-2024.
[27]
A. Boutros, A. Arora, and V. Betz, “Field-programmable gate array architecture for deep learning: Survey & future directions,” 2024.
[28]
S. Ahmad, S. Subramanian, V. Boppana, S. Lakka, F.-H. Ho, T. Knopp, J. Noguera, G. Singh, and R. Wittig, “Xilinx first 7nm device: Versal AI core (VC1902),” in 2019 IEEE Hot Chips 31 Symposium (HCS), pp. 1–28, 2019.
[29]
M. Langhammer, E. Nurvitadhi, B. Pasca, and S. Gribok, “Stratix 10 NX architecture and applications,” in The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’21, (New York, NY, USA), p. 57–67, Association for Computing Machinery, 2021.
[30]
M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, (New York, NY, USA), p. 807–820, Association for Computing Machinery, 2019.
[31]
H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, and A. Parashar, MAESTRO:A data-centric approach to understand reuse, performance, and hardware cost of DNN mappings,” IEEE Micro, vol. 40, no. 3, pp. 20–29, 2020.
[32]
J. Cai, Y. Wei, Z. Wu, S. Peng, and K. Ma, “Inter-layer scheduling space definition and exploration for tiled accelerators,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA ’23, (New York, NY, USA), Association for Computing Machinery, 2023.
[33]
S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, and A. Raghunathan, SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 13–26, 2017.
[34]
S. Zheng, X. Zhang, L. Liu, S. Wei, and S. Yin, “Atomic dataflow based graph-level workload orchestration for scalable DNN accelerators,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 475–489, 2022.
[35]
Y. Yang, J. S. Emer, and D. Sanchez, ISOSceles: Accelerating sparse CNNs through inter-layer pipelining,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 598–610, 2023.
[36]
J. Cai, Z. Wu, S. Peng, Y. Wei, Z. Tan, G. Shi, M. Gao, and K. Ma, “Gemini: Mapping and architecture co-exploration for large-scale DNN chiplet accelerators,” in 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 156–171, IEEE, 2024.
[37]
AMD, Versal AI Core Series VCK190 Evaluation Kit, 2021. Available: https://www.xilinx.com/products/boards-and-kits/vck190.html, , Accessed: 16-August-2024.
[38]
M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “core directory in Tangram,” 2019. Available: https://github.com/stanford-mast/nn_dataflow/tree/master/nn_dataflow/core, Accessed: 5-November-2024.
[39]
H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, and A. Parashar, AHWAccelerator.hpp in Maestro,” 2020. Available: https://github.com/maestro-project/maestro/blob/master/cost-model/include/abstract-hardware-model/AHW_Accelerator.hpp, Accessed: 5-November-2024.
[40]
J. Cai, Z. Wu, S. Peng, Y. Wei, Z. Tan, G. Shi, M. Gao, and K. Ma, “core.cpp in GEMINI-HPCA2024,” 2024. Available: https://github.com/SET-Scheduling-Project/GEMINI-HPCA2024/blob/master/src/core.cpp, Accessed: 5-November-2024.
[41]
J. Cai, Y. Wei, Z. Wu, S. Peng, and K. Ma, “core.cpp in SET-ISCA2023,” 2023. Available: https://github.com/SET-Scheduling-Project/SET-ISCA2023/blob/master/src/core.cpp, Accessed: 5-November-2024.
[42]
R. Ma, J.-C. Hsu, T. Tan, E. Nurvitadhi, D. Sheffield, R. Pelt, M. Langhammer, J. Sim, A. Dasu, and D. Chiou, “Specializing FGPU for persistent deep learning,” in 2019 29th International Conference on Field Programmable Logic and Applications (FPL), pp. 326–333, 2019.
[43]
J. E. Smith, “Decoupled access/execute computer architectures,” in Proceedings of the 9th Annual Symposium on Computer Architecture, ISCA ’82, (Washington, DC, USA), p. 112–119, IEEE Computer Society Press, 1982.
[44]
T. Nowatzki, V. Gangadhar, N. Ardalani, and K. Sankaralingam, “Stream-dataflow acceleration,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 416–429, 2017.
[45]
K. Sankaralingam, T. Nowatzki, V. Gangadhar, P. Shah, M. Davies, W. Galliher, Z. Guo, J. Khare, D. Vijay, P. Palamuttam, M. Punde, A. Tan, V. Thiruvengadam, R. Wang, and S. Xu, “The Mozart reuse exposed dataflow processor for AI and beyond: industrial product,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA ’22, (New York, NY, USA), p. 978–992, Association for Computing Machinery, 2022.
[46]
J. Weng, S. Liu, V. Dadu, Z. Wang, P. Shah, and T. Nowatzki, DSAGEN: Synthesizing programmable spatial accelerators,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 268–281, 2020.
[47]
S. Liu, J. Weng, D. Kupsh, A. Sohrabizadeh, Z. Wang, L. Guo, J. Liu, M. Zhulin, R. Mani, L. Zhang, J. Cong, and T. Nowatzki, OverGen: Improving FPGA usability through domain-specific overlay generation,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 35–56, 2022.
[48]
A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig, V. Pavlov, A. Zhai, M. Gambhir, A. Jaleel, R. Allmon, R. Rayess, S. Maresh, and J. Emer, “Triggered instructions: a control paradigm for spatially-programmed architectures,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY, USA), p. 142–153, Association for Computing Machinery, 2013.
[49]
S. Swanson, K. Michelson, A. Schwerin, and M. Oskin, WaveScalar,” in Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., pp. 291–302, 2003.
[50]
V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim, DySER: Unifying functionality and parallelism specialization for energy-efficient computing,” IEEE Micro, vol. 32, no. 5, pp. 38–51, 2012.
[51]
AMD, Vitis Unified Software Platform 2023.2, 2023. Software.
[52]
I. Hugging Face, BERT-Large model implementation in PyTorch,” 2023. Software.
[53]
NVIDIA, DeepLearningExamples: BERT language modeling with TensorFlow 2,” 2024. Available: https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/LanguageModeling/BERT, Accessed: 16-August-2024.
[54]
AMD, BEAM tools,” 2021. Available: https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/973078551/BEAM+Tool+for+VCK190+Evaluation+Kit, Accessed: 16-August-2024.
[55]
Intel, INT8 vs. FP32 performance comparision.” Available: https://intelkevinputnam.github.io/openvino-docs/pages/openvino_docs_performance_int8_vs_fp32.html, Accessed: 5-November-2024.
[56]
E. Taka, A. Arora, K. C. Wu, and D. Marculescu, “MaxEVA: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine,” in 2023 International Conference on Field Programmable Technology (ICFPT), pp. 96–105, IEEE, 2023.
[57]
X. Deng, S. Wang, T. Gao, J. Liu, L. Liu, and N. Zheng, AMA: An analytical approach to maximizing the efficiency of deep learning on Versal AI engine,” in 2024 34th International Conference on Field-Programmable Logic and Applications (FPL), pp. 227–235, 2024.
[58]
NVIDIA, “Nvidia T4 tensor core GPU,” 2018. Available: https://resources.nvidia.com/en-us-gpu-resources/t4-tensor-core-datas?lx=CPwSfP, Accessed: 21-November-2024.
[59]
NVIDIA, “Nvidia Tesla V100 GPU architecture,” 2017. Available: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf, Accessed: 21-November-2024.
[60]
NVIDIA, “Nvidia L4 tensor core GPU,” 2024. Available: https://resources.nvidia.com/en-us-data-center-overview/l4-gpu-datasheet, Accessed: 21-November-2024.
[61]
AMD, “Versal ACAP package pinout documentation: Mechanical - vc1802 and vc1902,” 2024. Available: https://docs.amd.com/r/en-US/am013-versal-pkg-pinout/VIVA1596-Mechanical-VC1802-and-VC1902, Accessed: 16-August-2024.