November 27, 2024
As the field of quantum computing grows, novel algorithms which take advantage of quantum phenomena need to be developed. As we are currently in the NISQ (noisy intermediate scale quantum) era, quantum algorithm researchers cannot reliably test their algorithms on real quantum hardware, which is still too limited. Instead, quantum computing simulators on classical computing systems are used. In the quantum circuit model, quantum bits (qubits) are operated on by quantum gates. A quantum circuit is a sequence of such quantum gates operating on some number of qubits. A quantum gate applied to a qubit can be controlled by other qubits in the circuit. This applies the gate only to the states which satisfy the required control qubit state.
We particularly target FPGAs as our main simulation platform, as these offer potential energy savings when compared to running simulations on CPUs/GPUs.
We focus on full-state-vector simulation of the quantum circuit model, which involves storing all possible states in memory and applying each gate in sequence. Because every qubit can be in 2 states, the memory requirement scales exponentially with the number of qubits. In general, each quantum gate operation involves accessing the entire state vector in pairs and so the number of steps to execute each gate operation grows exponentially with the number of qubits. Each added control to a gate halves the number of pairs that need to be accessed.
In this work, we present a memory access pattern to optimise the number of iterations that need to be scheduled to execute a quantum gate such that only the iterations which access the required pairs (determined according to the control qubits imposed on the gate) are scheduled. We show that this approach results in a significant reduction in the time required to simulate a gate for each added control qubit. We also show that this approach benefits the simulation time on FPGAs more than CPUs and GPUs and allows to outperform both CPU and GPU platforms in terms of energy efficiency, which is the main factor for scalability of the simulations.
quantum computing, quantum circuit simulation, hardware acceleration, scheduling
Quantum computing takes advantage of uniquely quantum phenomena like superposition and entanglement to allow for a new model of computation that can be up to exponentially more powerful than the classical model of computation, for specific problems. The problems to which quantum computing can be applied with a significant speedup remain limited and so research into further quantum algorithms is necessary.
Quantum computing simulation currently plays a key role in the development and verification of novel algorithms that can take advantage of the computational power offered by quantum computing.
In quantum computing, quantum bits (or qubits) take the place of classical bits as the units of information that are manipulated to perform computation. A qubit is described by two complex probability amplitudes:
\[\ket{q} = a\ket{0} + b\ket{1} \hskip 0.1cm ; \hskip 0.1cm |a|^2+|b|^2 = 1 \label{eq95qubit95def}\tag{1}\]
Every qubit added to the system doubles the number of states that can exist in superposition. And so, in general, we need \(2^n\) complex number to represent an \(n\)-qubit quantum system in a coherent state.
The most widely used model of quantum computation is the quantum circuit model, where algorithms are represented by sequences of quantum gate operations acting on one or more qubits, forming quantum circuits. Qubits are represented by horizontal lines in quantum circuit diagrams, stacked vertically for multi-qubit registers. Quantum gates are placed on these lines in square boxes. The \(X\) gate (\(X \equiv \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}\)) has the effect of flipping a qubit in the computational basis, such that \(X\ket{0}=\ket{1}\), \(X\ket{1}=\ket{0}\), by swapping of the qubit’s probability amplitudes. The \(X\) gate is often indicated by an xor symbol (\(\oplus\)) on a quantum circuit. This is the analogue of the classical NOT gate. The Hadamard gate (\(H \equiv \frac{1}{\sqrt{2}}\begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}\)) introduces a superposition on a qubit that is in the computational basis: \(H\ket{0}=\frac{1}{\sqrt{2}}(\ket{0}+\ket{1})\) and \(H\ket{1}=\frac{1}{\sqrt{2}}(\ket{0}-\ket{1})\). Controls can be imposed on gate application to entangle two or more qubits. A control qubit is indicated by a black dot on the qubit’s line vertically connected to the gate. The left part of Figure 1 demonstrates a controlled \(X\) gate.
In more complex circuits, a series of such quantum gate operations is performed step-by-step. For illustration, the right-hand side of Figure 1 shows how a general single-qubit gate \(G\) with a negative control (here \(\ket{q1}=\ket{0}\)) can be implemented using a sequence of three gate operations. The number of time slices required to run the circuit (i.e. circuit slices with gates that affect mutually exclusive qubits) defines the circuit depth.
Extensive research has resulted in a range of highly-optimized quantum computer simulators for multi-core and distributed computing architectures. These include Intel Quantum Simulator [1], ProjectQ [2], JUMPIQCS [3], Qrack [4], and QCGPU [5].
The simulation of quantum computers and, specifically, quantum-circuit implementations of algorithms on FPGAs has been the topic of more recent research works. Examples of works focusing on FPGAs include[6], [7], [8], [9], [10], [11], [12], [13], and [14].
A fuller summary on recent works on cluster-based and FPGA-based simulation of quantum circuits is presented in the authors’ earlier work [15].
In this work we target full state vector simulation as it is the most general simulation approach, allowing for complete control and introspection of the state of the system at any point during the computation.
In our simulator, we support single qubit gates with arbitrary number of controls. Our supported gate set is currently comprised of the Hadamard gate (\(H\)), the Pauli gates (\(X\), \(Z\), and \(Y\)) and phase rotation gates (\(R_m\)). We allow an arbitrary number of controls on any gate, up to a compile-time architecture parameter for the maximum possible number of controls per gate. This set of gates supports universal quantum computation.
For an n-qubit simulation, \(2^n\) complex numbers need to be stored in memory. In general, to simulate the operation of one quantum gate, the entire memory space needs to be accessed. The memory space is accessed in pairs, and the access pattern depends on the gate’s target qubit index, \(t\), where \(t=0\) represents the least significant qubit in the register. In particular, the strides between the elements of each pair is given by \(2^t\). Consider the application of a general gate, \(G = \begin{bmatrix} a & b \\ c & d \end{bmatrix}\), where \(a, b, c, d \in \mathbb{C}\), on a qubit \(t\). If the initial quantum state is defined by \(\ket{\Psi}_0\) and the state after applying the gate by \(\ket{\Psi}_1\) then: \[\ket{\Psi}_1 = \sum_{k=0}^{2^n} C_k^{(1)} \ket{k} \label{eq95psi195state}\tag{2}\] where complex amplitudes \(C_k^{(1)}\) are related to previous amplitudes \(C_k^{(0)}\) as: \[\begin{align} C_k^{(1)} &=& a C_k^{(0)} + b C_{k+2^t}^{(0)} \label{eq95psi195G95amplitudes}\\ C_{k+2^t}^{(1)} &=& c C_k^{(0)} + d C_{k+2^t}^{(0)}\nonumber \end{align}\tag{3}\] for \(k\in[0, 2^{n-1}-1]\).
Each quantum gate requires \(2^{n-1}\) iterations to be simulated, where each iteration accesses a unique pair of complex numbers in the memory. Because of this dependency, concurrent simulation of gate operations is not possible. We can however parallelise the iterations of any one gate.
Figure 2 shows the access pattern for the execution of a non-controlled gate on an example 3-qubit register. The top half of the figure shows the example for \(t=0\), where the pairs are constituted by contiguous elements from the state vector. The compute units update the pairs based on equation 3 , and stores them back into the state vector memory. The bottom part of the figure shows the access pattern for higher values of \(t\).
When controls are imposed on a gate, each control halves the set of pairs that need to be updated in memory. The condition to apply a controlled gate to an iteration pair is that the value of the bit (of the amplitude’s index in the state vector in binary) in the same position as the control qubit in the register is 1. It can be observed that since the two elements constituting a pair will only differ in the target qubit, then the controls will cancel whole iterations at a time (as opposed to possibly updating a single element of the pair); i.e. pairs are either entirely updated together or not at all.
Traditionally, this is handled by a check in the execution of an iteration, which determines whether the memory access should go ahead after the pair indices have been computed based on the target qubit’s index. On a CPU/GPU with dynamically scheduled iterations, this does not cause a loss in performance as no cycles are scheduled for reading and writing to memory until the control flow determines that this should happen. On an FPGA, however, the potential memory accesses are statically scheduled and so clock cycles are always allocated for them; wasting cycles when the control flow determines that no computation should take place.
We use OpenCL kernels inspired by [5] to facilitate the simulation of a quantum gate. Figure 3 demonstrates pseudocode for the execution of an iteration. The kernels are built for the FPGA board using Intel’s OpenCL FPGA SDK, as NDRange kernels. The global IDs for the NDRange kernels are the iteration indices and so go from \(0\) to \(2^{n-1}-1\) for \(n\) qubits. The host schedules an NDRange kernel in this way for every gate in the circuit, serially. We use 2 32-bit floating point numbers to represent each complex amplitude in memory.
The primary contribution of this work comes from realising that we can schedule exactly as many iterations which perform memory accesses as are required for the gate, taking the gate’s controls into account. To demonstrate this, we introduce the concept
of an iteration index set in the context of full state vector simulation. As described above, to simulate a quantum gate over an \(n\)-qubit register, \(2^{n-1}\) iterations are required,
giving an iteration index set of \([0, 2^{n-1}-1]\). Define this set as \(I_g\) for global iteration index set. These are the indices which can be plugged into the ithCleared(i,t)
function [5]
along with the target qubit \(t\) to return the index of the first element in the pair of amplitudes that need to be processed for any given iteration \(i \in I_g\).
Our goal is to be able to schedule only the number of iterations that are required taking into account each added control, i.e. introduce a reduced iteration index set \(I_r = [0, 2^{n-n_c-1}-1]\), where \(n_c\) is the number of controls of the gate.
The challenge is to map this reduced set \(I_r\) back to the global set \(I_g\), as directly scheduling \(I_r\) would result in incorrect calculations.
The idea is to iteratively map the smaller iteration sets back to \(I_g\), considering each control qubit, selecting the values of \(i\) from \(I_g\) (which
can be plugged into ithCleared
) that correspond to \(I_r\) taking into account the set of controls applied to the gate \(C = \{c_0, c_1, ...,
c_{n_c-1}\}\).
This means we need a map from the values of \(I_r\) to the values of \(I_g\). For a single control, let \(I_{r_0}\) be the reduced iteration set; this will have half the cardinality of \(I_g\). If we introduce a further control, let the corresponding reduced iteration set be \(I_{r_1}\) (which will have half the cardinality of \(I_{r_0}\)); as long as this further control is higher in the register than the first control, we can map from \(I_{r_1}\) to \(I_{r_0}\), and then finally map from \(I_{r_0}\) to \(I_g\). This is the general idea for handling higher numbers of controls: iteratively map from the smaller iteration sets until the global iteration index in \(I_g\) is reached.
To realise this, we map out the required memory accesses for differently controlled gates to find a pattern. We start from the access pattern demonstrated in Figure 2 and rearrange the amplitudes such that iteration pairs are contiguous (they are not actually rearranged in memory, rather this is just for demonstration). We then impose controls, in an ascending order, for every target qubit example. We make an exhaustive list of controls: for a 4-qubit example, there will be three cases with 1 control qubit, three cases with 2 control qubits, and one case with 3 control qubits. When controls are arranged in a logical order across different target examples, a pattern emerges in the iterations which are skipped.
For brevity, Figure 4 shows the example for a 3-qubit register. Note that the first row in each target example (representing the uncontrolled case) is now shown in binary, to make it easier to recognise the control condition for the controlled cases, and the whole table is rearranged such that iteration pair elements are contiguous. For each controlled case, we cross out the iteration pairs which do not satisfy the controls (where the control qubits are 0 in the binary representation). Regardless of the target qubit, the same iteration skips pattern emerges.
In order to encode these iteration skips, we start by introducing the concept of an adjusted control, which is a re-indexing of the control qubits relative to the target qubit: if the control qubit is greater than the target qubit, subtract one, otherwise keep its original value. This is shown in Eq. 4 . This method gives the same values of adjusted controls for all the control qubit enumerations for different values of the target qubit; e.g. for the 3-qubit register demonstrated in Figure 4, the method gives us adjusted control values of \(c_{adj} = 0\), \(c_{adj} = 1\), and \(c_{adj} = 0,1\) for the three possible control qubit cases. We can then compute a skip interval corresponding to each adjusted control value as \(2^{c_{adj}}\). Based on the computed skip interval, we can map any iteration index belonging to a reduced iteration index set (\(i_{r_{k+1}}\)) to a higher iteration index (\(i_{r_k}\)) by adding to it, following the iterative formula shown in Equation 5 where we can consider \(i_g\), the global iteration index, as \(i_{r_{-1}}\).
\[c_{adj} = \left\{ \begin{array}{ c l } c - 1 & \quad \textrm{if } c > t\\ c & \quad \textrm{otherwise} \end{array} \right. \label{eq:adj95controls95formula}\tag{4}\]
\[i_{r_k} = i_{r_{k+1}} + (\lfloor i_{r_{k+1}} / 2^{c_{adj}} \rfloor + 1) \times 2^{c_{adj}} \label{eq:controls95iterative95formula}\tag{5}\]
To illustrate this, we take the \(t=1\) case for a 3-qubit register as an example (the middle table in Figure 4). For 3 qubits, the global iteration index set is \(I_g = \{0,1,2,3\}\) (as there are \(2^3=8\) amplitudes, the number of iterations is half the size of the state vector). When one control is imposed (\(c_0 = 0\)
or \(c_0 = 2\)), the reduced iteration index set is \(I_{r_0} = \{0,1\}\), and these are the values which the NDRange kernel will be scheduled with (i.e. the values that get_global_id()
will return). For \(c_0 = 0\), the adjusted control remains the same as \(c_0 < t\), so \(c_{0_{adj}} = 0\). The skip interval is then computed as \(2^{c_{0_{adj}}} = 2^0 = 1\).
Then, following Equation 4, \(I_{r_0}\) can be mapped to \(I_g\) in this way: \(\{0,1\} \rightarrow \{0+(0/1+1)\times1, 1+(1/1+1)\times1\} = \{1,3\}\), meaning the second and fourth global iterations are the required ones, which matches the result in the figure. For the case where \(c_0 = 2\), \(c_0 > t\) and so the adjusted control is \(c_{0_{adj}} = c_0 - 1 = 1\), giving a skip interval of \(2^1 = 2\). Following the same formula as the previous case, we can map the reduced iteration index set to the global one in this way: \(\{0,1\} \rightarrow \{0+(0/2+1)\times2, 1+(\lfloor1/2\rfloor+1)\times2\} = \{2,3\}\), meaning the last two global iterations are the desired ones, which again can is verified by the figure. Finally, for the case where we have two controls \(c_0 = 0, c_1 = 2\), we perform two mappings: one from \(I_{r_1} = \{0\}\) (corresponding to the two imposed controls) to \(I_{r_0} = \{0,1\}\) (corresponding to one imposed control) and then to \(I_{r_{-1}} = I_g = \{0,1,2,3\}\), the global iteration index set.
To go from \(I_{r_1}\) to \(I_{r_0}\), we take the first control \(c_0 = 0\) and apply the formula: \(c_{0_{adj}} = c_0 = 0\) as \(c_0 < t\), and the skip interval is again \(2^0 = 1\). The single element, \(i_{r_1} = 0\) in \(I_{r_1}\) is then mapped to \(i_{r_0} = 0+(0/1+1)\times1 = 1\) in \(I_{r_1}\). We then perform the second mapping, for \(c_1 = 2\): \(c_{1_{adj}} = c_1 - 1 = 1\) as \(c_1 > t\), and the skip interval is \(2^1 = 2\). Applying the formula gives \(i_g = 1+(\lfloor1/2\rfloor+1)\times2 = 3\), meaning only the final iteration in the global iteration set is to be performed. The only condition for this formula to function correctly is that the controls have to be in ascending order, i.e. strictly \(c_0 < c_1 < ... < c_{n_c-1}\).
With this formula, we can schedule the compute kernels with exactly the number of iterations that will always update the memory. We use the formula to iteratively go from a reduced iteration index to the equivalent global iteration index allowing us to
use the previously used memory indexing strategy based on the ithCleared
function. Figure 5 shows the implementation of the formula in the OpenCL kernel, for up to 2 controls.
This block would go above the pair element index computation in the pseudocode for the baseline kernel shown above in Figure 3, and then \(i_g\) would be used in the computation
of pairElement1
in place of \(i\) in the ithCleared
function. The controls check block that utilises a boolean flag perform
would no longer be needed and there would be no control flow checks on the memory access.
We evaluate our optimised iteration scheduling by building two target architectures: a baseline based on the kernel pseudocode in Figure 3 which always schedules \(2^{n-1}\) iterations, and an optimised iterations architecture scheduling \(2^{n-n_c-1}\) iterations, where \(n_c\) is the number of controls imposed on the gate being scheduled.
We utilised a system with a dual Intel Xeon E5-2609 V2 2.5 GHz processor and 64 GB DDR3 1.6 GHz RAM running Scientific Linux 6.8. Our system hosts a single Nallatech PCIe-385N A7 FPGA board with 8 GB DDR3 RAM connected through a PCIe 2 connection and we used the Intel SDK for OpenCL version 17.1 to compile for and program the board. The host code is compiled using G++ (GCC version 4.7.2). We use the same FPGA host system to evaluate the CPU and we run the same OpenCL kernel with the same host code. To evaluate on the GPU, we have access to a different host with an NVIDIA GK110B (GeForce GTX TITAN Black), with access to 6GB of GDDR5 VRAM. The GPU system compiles the C++ host code with G++ (GCC version 5.4.0).
In order to be competitive with distributed computing simulation methods, it will be necessary to move to clusters of FPGAs. In the case of supercomputing clusters, energy consumed becomes an important metric, and so we choose to focus on the energy consumed by a device to simulate an evaluation circuit. Our FPGA’s power consumption is rated at 25W [16], while our CPU consumes 160W and the GPU consumes 250W. To compute the energy consumed during the execution of a circuit, we multiply the total time of the circuit by the target platform’s power consumption.
Kernel | Baseline | Optimised iterations | Total Available |
---|---|---|---|
ALUTs | \(59772 (17\%)\) | \(60029 (17\%)\) | \(345200\) |
FFs | \(71743 (10\%)\) | \(71941 (10\%)\) | \(690400\) |
RAMs | \(408 (20\%)\) | \(408 (20\%)\) | \(2014\) |
DSPs | \(16 (1\%)\) | \(16 (1\%)\) | \(1590\) |
\(F_{max}\) (MHz) | \(296.02\) | \(308.35\) | \(-\) |
For the purpose of evaluating the optimised iteration scheduling technique, we choose to synthesise both target architectures with one compute unit on the FPGA. As such we also compare against the CPU and GPU running with one OpenCL compute unit. Table 1 shows the resource utilisation on the FPGA for both architectures with a maximum of 2 controls per gate. We observe that both architectures use roughly the same amount of resources and the optimised iterations architecture is able to run at a slightly higher maximum frequency.
We choose three algorithms as evaluation targets: the Quantum Fourier Transform (QFT), which forms an important part of several other quantum algorithms, an integer squaring circuit, and a streaming circuit.
The Quantum Fourier Transform is a quantum algorithm that generalises the classical Fourier Transform to the quantum realm. It is a key component in several quantum algorithms, including Shor’s algorithm [17], QFT-based arithmetic algorithms [18], and quantum phase estimation algorithms. The QFT maps an input quantum state to its Fourier transformed state in polynomial quantum time, allowing for the efficient manipulation of quantum information. The QFT circuit has a staircase pattern of Hadamard and controlled phase rotation gates. An example 5-qubit QFT circuit is shown in Figure 6; generalised n-qubit QFT circuits can be constructed following the same pattern.
Figure 7 shows the total energy consumption for running QFT circuits of various circuit widths across the three evaluation platforms and both architectures.
Platform and Kernel | Total time (s) | Total energy (J) |
---|---|---|
Baseline FPGA | 413.83 | 10345.73 |
Optimised FPGA | 251.00 | 6275.05 |
Baseline CPU | 90.88 | 14540.74 |
Optimised CPU | 160.93 | 25748.23 |
Baseline GPU | 9.58 | 2395.93 |
Optimised GPU | 9.22 | 2306.12 |
Platform and Kernel | Total time (s) | Total energy (J) |
---|---|---|
Baseline FPGA | 909.23 | 22730.74 |
Optimised FPGA | 501.32 | 12533.00 |
Baseline CPU | 174.71 | 27953.62 |
Optimised CPU | 148.97 | 23834.92 |
Baseline GPU | 16.52 | 4129.95 |
Optimised GPU | 13.26 | 3314.17 |
Platform and Kernel | Total time (s) | Total energy (J) |
---|---|---|
Baseline FPGA | 26.07 | 651.66 |
Optimised FPGA | 3.78 | 94.44 |
Baseline CPU | 10.36 | 1656.83 |
Optimised CPU | 2.09 | 334.01 |
Baseline GPU | 1.73 | 434.46 |
Optimised GPU | 0.95 | 238.30 |
In [15], the authors demonstrate squaring circuits that use controlled Cuccaro adders [19] and shift operators. We use these circuits as our second evaluation target. Cuccaro adders use quantum gates with up to two controls, and so our squaring circuits use gates with up to three controls, as they use controlled Cuccaro adders. Figure 8 shows the energy consumption of squaring circuits with different input bit sizes. An \(n\)-bit input squaring circuit uses \(3\times n + 2\) qubits, and so our highest qubit example (the 9-bit input) uses 29 qubits.
Our final evaluation target is a streaming circuit useful in quantum circuit implementations of the Lattice Boltzmann model, which was used in modelling quantum walks on a one- and two-dimensional lattices [20]. An example 4-qubit streaming circuit is shown in 9. We can construct and run similar circuits with up to 29-qubits. The reason these circuits are chosen is because an \(n\)-qubit streaming circuit contains gates with up to \(n-1\) controls, making it a good "best-case" test circuit for our optimisation. Figure 10 shows the energy consumed for different streaming circuits.
As expected, circuits where more gates have high number of control qubits benefit most from our optimisation. Tables 2, 3, and 4 present summaries of the timing and energy consumed for the largest circuits (all 29 qubits) for each of the evaluation algorithms. For the QFT (which has at most one control on a gate), we see almost a \(2\times\) improvement for the FPGA, while the CPU performance worsens (a result of spending extra cycles computing the optimised indexing formula), and the GPU performance is mostly unaffected. For the squaring circuit (which has up to 2 more controls compared to the QFT), we see that all platforms benefit from utilising the optimisation, with the FPGA again benefiting the most at almost \(2\times\) improvement in time and energy consumed. The streaming circuit, which has increasingly higher number of control qubits across its gates, demonstrates the most benefit for the FPGA simulation, at almost a \(7\times\) improvement over the baseline. This testcase shows that it is possible for the FPGA simulation to outperform the CPU and the GPU in energy efficiency, even though our FPGA is overdimensioned for the circuit.
In addition, based on resource utilisation demonstrated in Table 1, we could run the architecture on up to 4\(\times\) smaller FPGA, e.g. a Stratix V D3, which would result in better energy efficiency than the GPU-based simulation for all test cases.
In this work, we presented a technique for optimising the number of iterations that need to be scheduled for the execution of a quantum gate, ensuring that every iteration does useful work. We have demonstrated the specific advantage of our technique for FPGA-based simulation.
In the future, we intend to extend our FPGA architecture with support for multiple compute units to maximise resource utilisation. We aim to introduce further optimisations including gate fusion, fixed-point precision, and compute tables.