July 09, 2024
We present a GlobalFoundries 22FDX FD-SOI application-specific integrated circuit (ASIC) of a beamspace equalizer for millimeter-wave (mmWave) massive multiple-input multiple-output (MIMO) systems. The ASIC implements a recently-proposed power-saving technique called sparsity-adaptive equalization (SPADE). SPADE exploits the inherent sparsity of mmWave channels in the beamspace domain to reduce the dynamic power of matrix-vector products by skipping multiplications for which the magnitude of both operands are below pre-defined thresholds. Simulations with realistic mmWave channels show that SPADE incurs less than 0.7 dB SNR degradation at 1% target bit error rate compared to antenna-domain equalization. ASIC measurement results demonstrate an equalization throughput of 46 Gbps and show that SPADE offers up to 38% power savings compared to antenna-domain equalization. A comparison with state-of-the-art massive MIMO equalizer designs reveals that our ASIC achieves superior normalized energy efficiency.
Fifth generation (5G) and beyond-5G wireless communication systems take advantage of large contiguous portions of the available spectrum at millimeter-wave (mmWave) frequencies to enable wideband communication [1]. Corresponding basestations (BSs) rely on massive multiple-input multiple-output (MIMO) [2], which (i) mitigates the high path loss at mmWave frequencies [3] and (ii) enables multi-user (MU) communication by means of spatial multiplexing. Wideband communication requires high baseband sampling rates and massive MU-MIMO generates high-dimensional data—together, they significantly increase hardware complexity. In this paper, we present a hardware implementation of a technique that reduces the power consumption of data detection.
A promising approach to reducing complexity of data detection in all-digital mmWave massive MU-MIMO systems is to exploit the inherent sparsity of mmWave channels[3], [4] in the so-called beamspace. Converting a system from antenna-domain into beamspace is achieved by applying a spatial discrete Fourier transform (DFT) to the signals received at a uniform linear antenna array [5]–[10]. Uplink data detection in beamspace, with the goal of reducing implementation complexity, has been studied recently for mmWave massive MU-MIMO systems, mainly in the context of linear data detectors, as nonlinear methods typically incur higher complexity. Linear data detection consists of two phases: (i) preprocessing, where an equalization matrix is computed based on a channel-matrix estimate and (ii) equalization, where the equalization matrix is multiplied to the received vectors to obtain estimates of the transmitted data symbols. While preprocessing is performed only once per coherence interval, equalization must be performed for each received vector, hence, at much higher rates than preprocessing. In this paper, we focus on reducing the complexity of equalization and assume that preprocessing is performed externally.
Existing beamspace data detectors reduce equalization complexity by designing sparse equalization matrices with specific sparsity patterns, thereby reducing the number of multiplications required for equalization. Such sparsity-exploiting beamspace data detectors, however, either incur a notable performance degradation compared to conventional antenna-domain linear minimum mean squared error (LMMSE) equalization, e.g., [5], [6], or require preprocessing algorithms with extremely high computational complexity [7].
In [8], a different approach to reduce complexity by exploiting beamspace sparsity was proposed. The method is referred to as sparsity-adaptive equalization (SPADE) and leverages the fact that the LMMSE equalization matrix is already approximately sparse in beamspace and avoids computing a sparse equalization matrix with a specific sparsity pattern. To reduce equalization complexity, SPADE uses two pre-computed thresholds to skip multiplications whenever the absolute value of both operands are below these thresholds. As shown in [8], SPADE significantly reduces the number of required multiplications, while exhibiting comparable performance to state-of-the-art linear beamspace data detectors [5]–[7].
We present the first application specific integrated circuit (ASIC) capable of performing SPADE-based beamspace equalization as well as antenna-domain equalization for a massive MU-MIMO system with \(64\) BS antennas and up to \(16\) single-antenna user equipments (UEs). In addition, we demonstrate real-world power savings achieved by SPADE-based beamspace equalization over conventional, antenna-domain equalization through extensive ASIC measurements.
Boldface lowercase and uppercase letters represent column vectors and matrices, respectively. For a matrix \(\mathbf{A}\), the transpose is \(\mathbf{A}^\textrm{T}\) and Hermitian transpose \(\mathbf{A}^\textrm{H}\). The \(m\)th column of \(\mathbf{A}\) is \(\mathbf{a}_m = [\mathbf{A}]_m\), and the entry on the \(m\)th row and \(n\)th column is \(A_{m,n} = [\mathbf{A}]_{m,n}\). For a vector \(\mathbf{a}\), the \(k\)th entry is \(a_k = [\mathbf{a}]_k\), and the real and imaginary parts are \(\mathbf{a}^R\) and \(\mathbf{a}^I\), respectively. The \(\ell_\infty\)- and \(\ell_{\widetilde{\infty}}\)-norm is \(\|\mathbf{a}\|_\infty \triangleq\max_{k} |a_k|\) and \(\|\mathbf{a}\|_{\widetilde{\infty}} \triangleq\max\{\|\mathbf{a}^R\|_\infty,\|\mathbf{a}^{I}\|_\infty\}\), respectively [11]. Bars over variables indicate antenna-domain quantities. Expectation with respect to a random vector \(\mathbf{a}\) is denoted by \(\mathbb{E}_{\mathbf{a}}[\cdot]\).
We consider the uplink of an all-digital mmWave massive MU-MIMO system as depicted in . Here, \(U\) single-antenna UEs transmit data to a basestation that is equipped with a \(B\)-antenna uniform linear array. The antenna-domain received vector at the BS is given by3 \(\bar{\mathbf{y}} = \bar{\mathbf{H}} \mathbf{s}+ \bar{\mathbf{n}}\), where \(\bar{\mathbf{y}}\in\mathbb{C}^B\) is the antenna-domain receive vector, \(\bar{\mathbf{H}} \in \mathbb{C}^{B\times U}\) is the antenna-domain MIMO channel matrix, \(\mathbf{s}\in\mathbb{C}^U\) contains the transmit symbols of the \(U\) UEs (taken from a constellation set), and \(\bar{\mathbf{n}}\in\mathbb{C}^B\) models noise with i.i.d.circularly symmetric complex Gaussian entries with variance \(N_0\). The transmit symbols are assumed to satisfy the power constraint \(\mathbb{E}_{\mathbf{s}}[\mathbf{s}\mathbf{s}^\textrm{H}]=E_s\mathbf{I}_U\).
Using the well-known planar-wave approximation4 [12], we model the wireless channel between a UE and the BS as \[\begin{align} \label{eq:planarwave} \bar{\mathbf{h}} = \textstyle \sum_{\ell=0}^{L-1} \alpha_\ell \> \bar{\mathbf{a}}(\phi_\ell), \end{align}\tag{1}\] where \(L\) refers to the number of propagation paths, \(\alpha_\ell\in\mathbb{C}\) is the complex-valued channel gain of the \(\ell\)th path, and \[\begin{align} \label{eq:complexsinusoids} \bar{\mathbf{a}}(\phi_\ell) = \big[1, e^{j\phi_\ell},e^{j2\phi_\ell},\dots, e^{j(B-1)\phi_\ell} \big]^\textrm{T}, \end{align}\tag{2}\] where the spatial frequency \(\phi_\ell\) is determined by the \(\ell\)th path’s angle-of-arrival at the BS antenna array. Due to the predominantly directional nature of wave propagation at mmWave frequencies [13], \(L\) is typically much smaller than \(B\) in massive MIMO systems, meaning that the channel vector of each UE is a superposition of only a few complex sinusoids. Hence, the DFT of \(\bar{\mathbf{h}}\) in results in an approximately sparse vector [10], meaning that most of its entries are close to zero, while only few entries have large magnitude. By applying such a spatial DFT to the antenna-domain received vector \(\bar{\mathbf{y}}\), we arrive at the following beamspace system model: \[\begin{align} \label{eq:BD95model} \mathbf{y}= \mathbf{F}\bar{\mathbf{y}} = \mathbf{F}\bar{\mathbf{H}}\mathbf{s}+ \mathbf{F}\bar{\mathbf{n}} = \mathbf{H}\mathbf{s}+ \mathbf{n}. \end{align}\tag{3}\] Here, \(\mathbf{y}\in\mathbb{C}^B\) is the beamspace receive vector, \(\mathbf{F}\in\mathbb{C}^{B\times B}\) is the unitary DFT matrix, \(\mathbf{H}= \mathbf{F}\bar{\mathbf{H}}\) is the beamspace MIMO channel matrix, and \(\mathbf{n}= \mathbf{F}\bar{\mathbf{n}}\) is the beamspace-equivalent noise vector, which has the same statistics as \(\bar{\mathbf{n}}\) as \(\mathbf{F}\) is unitary. Therefore, beamspace system model in is statistically equivalent to the antenna-domain system model, and data detection using both models gives exactly the same result.
We now summarize SPADE [8], which is the main ingredient of our ASIC. First, we note that due to the approximate sparsity of the beamspace channel matrix \(\mathbf{H}\), the beamspace LMMSE equalization matrix \(\mathbf{V}= (\mathbf{H}^\textrm{H}\mathbf{H}+ \frac{N_0}{E_s} \mathbf{I})^{-1}\mathbf{H}^\textrm{H}\) also exhibits approximate sparsity. Second, since the beamspace receive vector \(\mathbf{y}\) is a linear combination of a few sparse vectors, it is also approximately sparse. SPADE exploits this approximate sparsity in both \(\mathbf{V}\) and \(\mathbf{y}\) in order to reduce the number of effective multiplications during equalization \(\hat{\mathbf{s}} = \mathbf{V}\mathbf{y}\).
Consider the \(u\)th inner product \(\hat{s}_u = \sum_{b=1}^B V_{u,b} y_b\). Due to approximate sparsity of \(\mathbf{y}\) and the rows of \(\mathbf{V}\), many of the operands \(V_{u,b}\) and \(y_b\) have small magnitude. Since the number of BS antennas \(B\) is large in massive MIMO, each such inner product is a sum of a large number of products. Consequently, one can skip multiplications where both \(|V_{u,b}|\) and \(|y_b|\) are small, without a notable perturbation on the inner-product’s result. Noting that each complex-valued multiplication consists of four real-valued multiplications, instead of skipping an entire complex-valued multiplication, one can individually turn on or off the real-valued multiplications based on the absolute values of their operands. SPADE uses two thresholds \(\tau_y\) and \(\tau_w\) to skip real-valued multiplications for which the absolute value of both operands are below the respective thresholds. These thresholds trade computational complexity for accuracy of the inner product and are determined offline based on simulations with the goal of minimizing (i) the approximation error and (ii) the multiplier activity rate, which is the average number of executed real-valued multiplications divided by the total \(4BU\) real-valued multiplications involved in \(\mathbf{V}\mathbf{y}\).
Note that the rows of \(\mathbf{V}\) may have vastly different dynamic ranges, calling for a separate threshold \(\tau_w\) for each row. In order to use the same threshold \(\tau_w\) for all of the inner products, the rows of \(\mathbf{V}\) are scaled to obtain \(\mathbf{W}= \mathrm{diag}(\boldsymbol{\alpha})\mathbf{V}\), where \(\alpha_u=1/(\|[\mathbf{V}^\textrm{T}]_u\|_{\widetilde{\infty}}+\varepsilon)\), and \(\varepsilon>0\) is a small constant that ensures that \(\|[\mathbf{W}^\textrm{T}]_u\|_{\widetilde{\infty}}\) is just below one. The final symbol estimates are then obtained as \(\hat{\mathbf{s}} = \mathrm{diag}(\boldsymbol{\alpha})^{-1} \mathbf{W}\mathbf{y}\). This row-wise scaling has the additional benefit of reducing the overall dynamic range of the entries in \(\mathbf{V}\), thereby reducing the minimum required bitwidth of entries in \(\mathbf{W}\) in its fixed-point representation.
In order to evaluate the impact of SPADE’s approximate inner-product computations on the system performance, we simulate the uncoded bit error-rate (BER) for beamspace LMMSE employing SPADE (referred to as ‘LMMSE-SPADE’). shows the results for LMMSE-SPADE with \(16\) or \(8\) UEs transmitting \(16\)-QAM symbols to a \(64\)-antenna BS over line-of-sight (LoS)5 and non-LoS channels generated with the QuadRiGa mmMAGIC simulator [14]. We also show the BER of the antenna-domain mode of our ASIC, labeled “LMMSE-A (ASIC),” as well as a floating-point reference. We observe that LMMSE-SPADE incurs less than \(0.4\) dB and \(0.7\) dB SNR loss at \(1\)% BER for \(U=8\) and \(U=16\), respectively, compared to LMMSE-A. The threshold pairs corresponding to these simulations and the ASIC measurements are detailed in . Furthermore, we compare the BER to three state-of-the-art massive MU-MIMO data detection algorithms: FLMMSE [15], recursive conjugate gradient (RCG) [16], and zero-forcing with beam selection (ZF-BS) [6]. While FLMMSE exhibits similar performance to LMMSE-SPADE, RCG suffers a performance loss in highly correlated mmWave channels, and ZF-BS suffers a notable performance loss in non-LoS channels.
Figure 2: Uncoded BER simulation results of LMMSE-A with floating-point operations, fixed-point LMMSE-A and LMMSE-SPADE as implemented in our ASIC, as well as existing baseline algorithms..
The SPADE thresholds \(\tau_y\) and \(\tau_w\) determine the inner-product accuracy and the multiplier activity rate. To simplify implementation and to eliminate the need for determining these thresholds on-the-fly for each considered system dimension and channel condition (LoS or non-LoS), we found a fixed pair of thresholds that achieves a desirable trade-off between BER and power savings. To this end, we performed BER simulations offline for a range of threshold pairs, and for each pair, we computed the average multiplier activity rate and the SNR operating point (i.e., the minimum SNR that achieves \(1\%\) BER). provides the results for a system with \(B=64\) BS antennas and \(U=16\) UEs, in which we also show the power consumption corresponding to each threshold pair, extracted from stimuli-based power simulations at \(500\) MHz (on the right y-axis), as well as the chosen threshold pair. Evidently, the multiplier activity well-predicts the power consumption.
Figure 3: Multiplier activity rate vs. SNR operating point at \(1\%\) BER in a \(64 \times 16\) system for several threshold pairs for (a) non-LoS and (b) LoS channels. The thresholds for our ASIC measurements were determined offline..
The high-level architecture of our ASIC is illustrated at the top of and consists of four main components: (i) an input SRAM, which stores input test vectors, (ii) a fast Fourier transform (FFT) unit that transforms the antenna-domain received vectors into beamspace, (iii) a SPADE-enabled matrix-vector multiplier (MVM), and (iv) an output SRAM that stores the results. The beamspace FFT is implemented using a fully-unrolled radix-4 architecture with low-resolution twiddle factors according to [17]; this enables transforming an entire \(B\)-element antenna-domain vector into beamspace per clock cycle at a small area and power overhead. To compare the efficacy of beamspace vs. antenna-domain processing, our architecture can be configured to operate either in antenna-domain mode, in which the beamspace FFT is turned off, or in beamspace mode, in which the beamspace FFT is active. In addition, when operating in beamspace mode, a save-power (SP) signal controls whether SPADE-based power saving is activated; this enables us to measure the impact of SPADE.
The top-level architecture of the SPADE-MVM is depicted at the bottom of and consists of \(U=16\) dot-product (DOTP) modules, each consisting of \(B=64\) SPADE complex-valued multipliers (CMs), whose internal architecture is shown in , and a \(B\)-input adder tree consisting of \(\text{log}_2(B)\) adder layers, with pipeline registers after every two layers. Each of the dot product modules computes one entry of \(\mathbf{W}\mathbf{y}\). SPADE-MVM receives a \(B\)-element complex-valued vector per clock cycle through its input ports. During a \(U\)-cycle weight-loading phase, indicated by raising a load-weight (LW) signal, a new equalization matrix \(\mathbf{W}\) is loaded into the registers of the SPADE-CMs. When LW is low, the incoming vector \(\mathbf{y}\) is multiplied to the rows of the stored \(\mathbf{W}\) simultaneously, performing one equalization operation \(\mathbf{W}\mathbf{y}\) per clock cycle. Furthermore, if LW is asserted, the absolute value of the real and imaginary parts of the input signals are compared to the w-threshold \(\tau_w\) and otherwise, they are compared to the y-threshold \(\tau_y\). The resulting comparison bits are broadcast to the SPADE-CMs, where they are used for adaptive power saving.
details the SPADE-CM architecture which contains four real-valued multipliers and two adders. The pipelining registers not only shorten the critical path and reduce glitching activity, but also provide a mechanism to conditionally mute each of the four real-valued multipliers by freezing their inputs. Two of the registers (cf. the hatched pattern in ) store the entries of the equalization matrix along with the w-threshold comparison bits (\(c_w^R\) and \(c_w^I\)). The other four input registers hold the real/imaginary parts of the input signal. If the save-power (SP) signal is asserted, and both the w- and y-threshold comparison bits corresponding to a particular register are set, then that register is disabled to mute switching activity, which reduces dynamic power consumption. At the input of the adders, the multiplexer selects zero if the preceding multiplier was muted. Depending on the channel conditions (e.g., depending on how sparse \(\mathbf{H}\) is) and the instantaneous receive vector, different subsets of multipliers within SPADE-CMs are muted; this results in adaptive savings in dynamic power.
Figure 6: (left) Micrograph of the fabricated ASIC with the key blocks highlighted; (right) area breakdown of all active circuitry..
Figure 7: ASIC measurement results: (a) voltage-frequency scaling (VFS); (b) clock frequency and equalizer power vs. body biasing voltage; power is measured in LMMSE-SPADE mode with LoS channels for \(U=16\)..
A micrograph of the fabricated ASIC along with an area breakdown is shown in . shows the impact of the core voltage (VDD) on the maximum clock frequency and shows the effect of forward body biasing (FBB) on the maximum clock frequency and the power consumption at a clock frequency of \(500\) MHz. For simplicity, in , only the PMOS body biasing voltage is varied and the NMOS body bias is fixed to \(0.4\) V. We observe that \(1.0\) V of PMOS forward body biasing, results in a \(29\)% increase in the maximum clock frequency and a \(46\)% increase in the total power consumption, which is due to the increase in leakage power. shows the power breakdown of our ASIC’s main modules in three modes: (i) LMMSE-A, in which the beamspace FFT is turned off and antenna-domain equalization is performed, (ii) LMMSE-B, in which beamspace equalization is performed but without SPADE-based power saving (no multiplications are muted), and (iii) LMMSE-SPADE, which performs beamspace equalization with SPADE-based power saving. LMMSE-SPADE provides \(21\)% and \(38\)% power savings with respect to LMMSE-A under non-LoS and LoS conditions, respectively.
compares the key characteristics of our fabricated ASIC with that of state-of-the-art massive MU-MIMO data detectors. We scale the reported metrics as detailed in footnotes \(^{ \textrm{f}}\) and \(^{ \textrm{g}}\). Our ASIC achieves similar or superior normalized energy- and area-efficiency compared to state-of-the-art hardware implementations. Even though the designs in [16], [18] achieve competitive energy and area efficiency, their BER is inferior in mmWave channels (cf. and [18]). The design from [15] achieves better area-efficiency by utilizing extremely low-resolution equalization weights and input vectors. This design, however, would require a more complicated preprocessing engine than our design. Finally, we emphasize that our ASIC achieves up to \(58.8\) Gbps at only \(833\) mW with body biasing, which is, to our knowledge, the highest 16-QAM throughput reported in the literature.
We have presented the first architecture and ASIC of a mmWave massive MU-MIMO equalizer capable of both antenna-domain and beamspace equalization with adaptive power-saving capabilities. In beamspace equalization mode, our ASIC is able to use SPADE [8] to adaptively reduce the dynamic power consumption. Measurement results have revealed that, despite the overhead of the necessary beamspace FFT, beamspace equalization with SPADE enables up to \(38\)% power savings compared to antenna-domain equalization. Furthermore, our ASIC achieves a record throughput of \(58.8\) Gb/s with body biasing at a better or similar normalized energy- and area-efficiency compared to state-of-the-art designs.
S. H. Mirfarshbafan and C. Studer are with the Department of Information Technology and Electrical Engineering, ETH Zürich, Switzerland (email: mirfarshbafan@iis.ee.ethz.ch and studer@ethz.ch).↩︎
The authors thank GlobalFoundries for providing silicon fabrication through the 22FDX University Program.↩︎
Although this model only holds for frequency-flat channels, an extension of this paper’s results to frequency-selective channels is straightforward if using orthogonal frequency-division multiplexing (OFDM).↩︎
We use the planar-wave approximation for mathematical analysis—our simulations, however, use channel vectors generated with spherical waves.↩︎
The generated LoS channel vectors also include reflective paths.↩︎