You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SWAP gate is ubiquitous in quantum computing and effects a change upon the quantum state as if the targeted qubits were physically swapped or relabelled.
Let $i_{[t]}$ notate the $t$-th bit of an unsigned integer $i$, and let $i_{\neg {a,b}}$ notate the unsigned integer resulting from flipping two bits of $i$; those at indices $a$ and $b$. As outlined here (Sec IV C), the SWAP gate targeting qubits $t_1$ and $t_2$ modifies the $i$-th basis state $\ket{i}$ as
When $SWAP_{t_1,t_2}$ operates upon a statevector $\ket{\psi} = \sum\limits_i \alpha_i \ket{i}$, only amplitudes $\alpha_i$ which satisfy $i_{[t_1]} \ne i_{[t_2]}$ are changed; they are swapped with a pair amplitude at index $i\neg{t_1,t_2}$.
Read how to simulate this operation in distributed settings here (Sec IV C).
multi-SWAP
Many quantum algorithms feature circuits containing a contiguous sequence of non-overlapping SWAP gates.
Some unrelated QuEST routines even use a sequence of SWAPs to reorder qubits, in order to change the communication pattern during distributed simulation.
This is called "cache blocking" and is used by distributed simulators like cuQuantum, Qiskit, mpiQulacs, often as the primary means of distribution. In QuEST, this technique is necessary when effecting arbitrary matrices targeting two or more qubits, like via applyCompMatr2. See Sec. IV D (pg 21) here for an explanation.
It is prudent to simulate a multi-SWAP (i.e. a sequence of contiguous SWAP gates) as fast as possible, and to reduce its communication costs when simulated in distributed settings.
Problem
Currently, when QuEST needs to perform a multi-SWAP, it simply performs each individual SWAP in-turn:
This is inefficient! Each successive SWAP is not modifying any amplitudes, but merely moving them. It should be much faster to move amplitude each straight to their final locations, performing all swaps "at once". This is called "SWAP fusion", first reported here, and requires more complicated communication logic than the single-SWAP case. SWAP fusion is implemented by e.g. cuQuantum and mpiQulacs.
Solution
Optimising multi-SWAP via implementing SWAP fusion would greatly benefit QuEST's distributed simulation of many-target matrix gates and Kraus maps (which use the aforementioned reordering trick). Implementing SWAP fusion will involve replacing the one-by-one swaps in localiser.cpp's anyCtrlMultiSwapBetweenPrefixAndSuffix with a new, custom, distributed routine in comm_routines.cpp. It may also require locally-parallelised (OpenMP and CUDA) routines to pack communication buffers, implemented in cpu_subroutines.cpp and gpu_subroutines.cpp.
An external contribution need not attempt implementing the GPU logic; the OpenMP logic alone is fine.
Implementing this will also require benchmarking to confirm the fused SWAPs are faster, including when each distributed node is not parallelised (with multithreading or GPU-acceleration), to ensure the communication benefits are not outweighed by additional local processing/branching (which is of course very unlikely).
Note
For all development, please work out of the
develbranchContext
SWAP
The SWAP gate is ubiquitous in quantum computing and effects a change upon the quantum state as if the targeted qubits were physically swapped or relabelled.
Let$i_{[t]}$ notate the $t$ -th bit of an unsigned integer $i$ , and let $i_{\neg {a,b}}$ notate the unsigned integer resulting from flipping two bits of $i$ ; those at indices $a$ and $b$ . As outlined here (Sec IV C), the SWAP gate targeting qubits $t_1$ and $t_2$ modifies the $i$ -th basis state $\ket{i}$ as
When$SWAP_{t_1,t_2}$ operates upon a statevector $\ket{\psi} = \sum\limits_i \alpha_i \ket{i}$ , only amplitudes $\alpha_i$ which satisfy $i_{[t_1]} \ne i_{[t_2]}$ are changed; they are swapped with a pair amplitude at index $i\neg{t_1,t_2}$ .
QuEST's code to simulate a SWAP gate in non-distributed settings resembles:
QuEST/quest/src/cpu/cpu_subroutines.cpp
Lines 258 to 282 in 53f8f3a
Read how to simulate this operation in distributed settings here (Sec IV C).
multi-SWAP
Many quantum algorithms feature circuits containing a contiguous sequence of non-overlapping SWAP gates.
Some unrelated QuEST routines even use a sequence of SWAPs to reorder qubits, in order to change the communication pattern during distributed simulation.
This is called "cache blocking" and is used by distributed simulators like cuQuantum, Qiskit, mpiQulacs, often as the primary means of distribution. In QuEST, this technique is necessary when effecting arbitrary matrices targeting two or more qubits, like via
applyCompMatr2. See Sec. IV D (pg 21) here for an explanation.It is prudent to simulate a multi-SWAP (i.e. a sequence of contiguous SWAP gates) as fast as possible, and to reduce its communication costs when simulated in distributed settings.
Problem
Currently, when QuEST needs to perform a multi-SWAP, it simply performs each individual SWAP in-turn:
QuEST/quest/src/core/localiser.cpp
Lines 906 to 932 in 53f8f3a
This is inefficient! Each successive SWAP is not modifying any amplitudes, but merely moving them. It should be much faster to move amplitude each straight to their final locations, performing all swaps "at once". This is called "SWAP fusion", first reported here, and requires more complicated communication logic than the single-SWAP case. SWAP fusion is implemented by e.g. cuQuantum and mpiQulacs.
Solution
Optimising multi-SWAP via implementing SWAP fusion would greatly benefit QuEST's distributed simulation of many-target matrix gates and Kraus maps (which use the aforementioned reordering trick). Implementing SWAP fusion will involve replacing the one-by-one swaps in
localiser.cpp'sanyCtrlMultiSwapBetweenPrefixAndSuffixwith a new, custom, distributed routine incomm_routines.cpp. It may also require locally-parallelised (OpenMP and CUDA) routines to pack communication buffers, implemented incpu_subroutines.cppandgpu_subroutines.cpp.Implementing this will also require benchmarking to confirm the fused SWAPs are faster, including when each distributed node is not parallelised (with multithreading or GPU-acceleration), to ensure the communication benefits are not outweighed by additional local processing/branching (which is of course very unlikely).