Implement SWAP fusion

> [!NOTE]
> For all development, please work out of the [`devel`](https://github.com/QuEST-Kit/QuEST/tree/devel) branch

## Context

> It may be worth a quick, prior read about QuEST's [software architecture](https://github.com/QuEST-Kit/QuEST/blob/main/docs/architecture.md).

### SWAP

The SWAP gate is ubiquitous in quantum computing and effects a change upon the quantum state as if the targeted qubits were physically swapped or relabelled. 


<img width="60" height="72" alt="Image" src="https://github.com/user-attachments/assets/2fc7c6e0-9b58-458a-87e6-97712c5bbea8" />


Let $i_{[t]}$ notate the $t$-th bit of an unsigned integer $i$, and let $i_{\neg \{a,b\}}$ notate the unsigned integer resulting from flipping two bits of $i$; those at indices $a$ and $b$. As outlined [here](https://arxiv.org/abs/2311.01512) (Sec IV C), the SWAP gate targeting qubits $t_1$ and $t_2$ modifies the $i$-th basis state $\ket{i}$ as

$$ SWAP_{t_1,t_2} \ket{i} = \begin{cases} \ket{i} \text{ when } i_{[t_1]} = i_{[t_2]}, \text{ else } \ket{i_{\neg \{t_1,t_2\}}} \end{cases}$$

When $SWAP_{t_1,t_2}$ operates upon a statevector $\ket{\psi} = \sum\limits_i \alpha_i \ket{i}$, only amplitudes $\alpha_i$ which satisfy $i_{[t_1]} \ne i_{[t_2]}$ are changed; they are swapped with a pair amplitude at index $i\neg\{t_1,t_2\}$.

$$
\alpha_i \rightarrow \alpha_{i\neg\{t_1,t_2\}}
$$

QuEST's code to simulate a SWAP gate in non-distributed settings resembles:

https://github.com/QuEST-Kit/QuEST/blob/53f8f3ad60e5b0171646ee8250c0e6ea65878b44/quest/src/cpu/cpu_subroutines.cpp#L258-L282

Read how to simulate this operation in distributed settings [here](https://arxiv.org/abs/2311.01512) (Sec IV C).

### multi-SWAP

Many quantum algorithms feature circuits containing a contiguous sequence of non-overlapping SWAP gates.


<img width="120" height="207" alt="Image" src="https://github.com/user-attachments/assets/93914a2e-8af6-40be-a099-2a902f1174fa" />


Some unrelated QuEST routines even use a sequence of SWAPs to reorder qubits, in order to change the communication pattern during distributed simulation.


<img width="300" alt="Image" src="https://github.com/user-attachments/assets/5185ee3e-33c2-4867-b04b-bd8f2fc0e9e7" />


This is called "_cache blocking_" and is used by distributed simulators like [cuQuantum](https://docs.nvidia.com/cuda/cuquantum/latest/custatevec/distributed-index-bit-swap.html), Qiskit, [mpiQulacs](https://arxiv.org/pdf/2203.16044), often as the primary means of distribution. In QuEST, this technique is necessary when effecting arbitrary matrices targeting **two** or **more** qubits, like via `applyCompMatr2`. See **Sec. IV D** (pg 21) [here](https://arxiv.org/pdf/2311.01512) for an explanation.

It is prudent to simulate a multi-SWAP (i.e. a sequence of contiguous SWAP gates) as fast as possible, and to reduce its communication costs when simulated in distributed settings.


## Problem

Currently, when QuEST needs to perform a multi-SWAP, it simply performs each individual SWAP in-turn:

https://github.com/QuEST-Kit/QuEST/blob/53f8f3ad60e5b0171646ee8250c0e6ea65878b44/quest/src/core/localiser.cpp#L906-L932

This is _inefficient_! Each successive SWAP is not _modifying_ any amplitudes, but merely moving them. It should be much faster to move amplitude each straight to their final locations, performing all swaps "at once". This is called "_SWAP fusion_", first reported [here](https://arxiv.org/abs/quant-ph/0608239), and requires more complicated communication logic than the single-SWAP case. SWAP fusion is implemented by e.g. cuQuantum and mpiQulacs.

## Solution

Optimising multi-SWAP via implementing SWAP fusion would greatly benefit QuEST's distributed simulation of many-target matrix gates and Kraus maps (which use the aforementioned reordering trick). Implementing SWAP fusion will involve replacing the one-by-one swaps in [`localiser.cpp`](https://github.com/QuEST-Kit/QuEST/blob/main/quest/src/core/localiser.cpp)'s [`anyCtrlMultiSwapBetweenPrefixAndSuffix`](https://github.com/QuEST-Kit/QuEST/blob/53f8f3ad60e5b0171646ee8250c0e6ea65878b44/quest/src/core/localiser.cpp#L906) with a new, custom, distributed routine in [`comm_routines.cpp`](https://github.com/QuEST-Kit/QuEST/blob/main/quest/src/comm/comm_routines.cpp). It may also require locally-parallelised (OpenMP and CUDA) routines to pack communication buffers, implemented in [`cpu_subroutines.cpp`](https://github.com/QuEST-Kit/QuEST/blob/53f8f3ad60e5b0171646ee8250c0e6ea65878b44/quest/src/cpu/cpu_subroutines.cpp) and [`gpu_subroutines.cpp`](https://github.com/QuEST-Kit/QuEST/blob/53f8f3ad60e5b0171646ee8250c0e6ea65878b44/quest/src/gpu/gpu_subroutines.cpp).

> An external contribution need not attempt implementing the GPU logic; the OpenMP logic alone is fine.

Implementing this will also require benchmarking to confirm the fused SWAPs are faster, including when each distributed node is _not_ parallelised (with multithreading or GPU-acceleration), to ensure the communication benefits are not outweighed by additional local processing/branching (which is of course very unlikely).

	template <int NumCtrls>
	void cpu_statevec_anyCtrlSwap_subA(Qureg qureg, vector<int> ctrls, vector<int> ctrlStates, int targ1, int targ2) {

	assert_numCtrlsMatchesNumCtrlStatesAndTemplateParam(ctrls.size(), ctrlStates.size(), NumCtrls);

	// each control qubit halves the number of iterations, each of which modifies 2 amplitudes, and skips 2
	qindex numIts = qureg.numAmpsPerNode / powerOf2(2 + ctrls.size());

	auto sortedQubits = util_getSorted(ctrls, {targ2, targ1});
	auto qubitStateMask = util_getBitMask(ctrls, ctrlStates, {targ2, targ1}, {0, 1});

	// use template param to compile-time unroll loop in insertBits()
	SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrls.size());
	int numQubitBits = numCtrlBits + 2;

	#pragma omp parallel for if(qureg.isMultithreaded)
	for (qindex n=0; n<numIts; n++) {

	// i01 = nth local index where ctrls are active, targ2=0 and targ1=1
	qindex i01 = insertBitsWithMaskedValues(n, sortedQubits.data(), numQubitBits, qubitStateMask);
	qindex i10 = flipTwoBits(i01, targ2, targ1);

	std::swap(qureg.cpuAmps[i01], qureg.cpuAmps[i10]);
	}
	}

	void anyCtrlMultiSwapBetweenPrefixAndSuffix(Qureg qureg, vector<int> ctrls, vector<int> ctrlStates, vector<int> targsA, vector<int> targsB) {

	// this is an internal function called by the below routines which require
	// performing a sequence of SWAPs to reorder qubits, or move them into suffix.
	// the SWAPs act on unique qubit pairs and so commute.

	/// @todo
	/// - the sequence of pair-wise full-swaps should be more efficient as a
	/// "single" sequence of smaller messages sending amps directly to their
	/// final destination node. This could use a new "multiSwap" function.
	/// - if the user has compiled cuQuantum, and Qureg is GPU-accelerated, the
	/// multiSwap function should use custatevecSwapIndexBits() if local,
	/// or custatevecDistIndexBitSwapSchedulerSetIndexBitSwaps() if distributed,
	/// although the latter requires substantially more work like setting up
	/// a communicator which may be inelegant alongside our own distribution scheme.

	// perform necessary swaps to move all targets into suffix, each of which invokes communication
	for (size_t i=0; i<targsA.size(); i++) {

	if (targsA[i] == targsB[i])
	continue;

	int suffixTarg = std::min(targsA[i], targsB[i]);
	int prefixTarg = std::max(targsA[i], targsB[i]);
	anyCtrlSwapBetweenPrefixAndSuffix(qureg, ctrls, ctrlStates, suffixTarg, prefixTarg);
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement SWAP fusion #595

Context

SWAP

multi-SWAP

Problem

Solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Implement SWAP fusion #595

Description

Context

SWAP

multi-SWAP

Problem

Solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions