Skip to content

Implement CPU Fused Multi-SWAP with Lightweight Pipeline Buffer Reuse (contributes to #595)#794

Open
zhangchuan92910 wants to merge 1 commit into
QuEST-Kit:develfrom
zhangchuan92910:devel
Open

Implement CPU Fused Multi-SWAP with Lightweight Pipeline Buffer Reuse (contributes to #595)#794
zhangchuan92910 wants to merge 1 commit into
QuEST-Kit:develfrom
zhangchuan92910:devel

Conversation

@zhangchuan92910

Copy link
Copy Markdown

Close #595

Overview

This PR implements Fused Multi-SWAP (SWAP fusion) for the distributed CPU backend, directly addressing the communication bottleneck outlined in #595. Instead of executing disjoint SWAPs sequentially (which wastefully forces amplitudes to cross the network multiple times), this implementation determines the final destination of each amplitude and sends it exactly once.

Algorithmic Approach & Acknowledgements

First, I would like to acknowledge the excellent work in PR #785 and PR #790. The core communication topology utilized in this PR shares the exact same mathematical foundation—the $2^k-1$ subcube XOR partner exchanges—which is a brilliant algorithmic choice for hypercube topologies.

However, this PR introduces a distinct engineering and memory-management philosophy that provides an alternative, highly optimized solution:

1. Lightweight Pipeline Buffer Reuse (Zero Dynamic Allocation)

Instead of batching all $2^k-1$ MPI_Isend/Irecv requests at once (which inherently requires allocating and managing a large, dynamic staging cache like fusedSwapSendCache to hold all payloads concurrently), this PR uses a sequential communication pipeline.
Because we process one XOR partner at a time, we only need a strict $O(N/2^k)$ buffer size per step. We can simply split the pre-existing qureg.cpuCommBuffer into send and recv halves.

  • Benefit: Absolutely zero dynamic memory allocation (malloc or std::vector), no global state cache to maintain, and a drastically reduced memory footprint, completely eliminating OOM risks on memory-constrained HPC environments.

2. Trade-offs in MPI Wait Latency

We acknowledge the trade-off in our pipeline approach regarding MPI latency. PR #790's approach of throwing all $2^k-1$ target exchanges to MPI concurrently and doing a single MPI_Waitall helps to hide network latency in massive clusters. Our sequential pipeline invokes MPI_Waitall once per target iteration. However, because the number of rounds $2^k-1$ is generally very small (mostly $k \le 4$), we believe trading a few extra rounds of network latency for absolute $O(1)$ memory safety and zero memory allocation overhead is a highly practical and robust choice for the CPU backend.

3. Native Compile-Time Unrolling

The buffer packing and unpacking kernels are tightly integrated with QuEST's native SET_VAR_AT_COMPILE_TIME macro. The inner packing loops are fully unrolled at compile time for $k \le 5$, ensuring the CPU benefits from branchless SIMD auto-vectorization.

4. Safety Fallback for GPUs

While this PR focuses strictly on the CPU distributed logic (GPU acceleration is currently out of scope for this specific PR), it safely handles GPU-accelerated Quregs by injecting syncQuregFromGpu() and syncQuregToGpu() to ensure data consistency without breaking execution.

Verification & Benchmarks

  • Correctness: Fully tested and verified to yield bit-for-bit identical statevectors compared to the sequential multi-SWAP implementation.
  • Performance Disclaimer: In my local simulation environments (running multiple MPI ranks on a single machine), the overall speedup was heavily bottlenecked by local memory bandwidth, showing no significant wall-clock time improvements.
  • Theoretical Network Reduction: However, calculating pure network throughput, for an $N=2^{26}$ state across 8 nodes performing $k=3$ multi-SWAPs, the communication payload is reduced from 3 full exchanges to 7 fractional exchanges, resulting in a ~41.6% reduction in total bytes transmitted. In a true distributed HPC cluster, this reduced throughput will translate to massive speedups.

AI disclosure

This code was developed with the assistance of Google Gemini for algorithmic refactoring, performance analysis, and optimization prototyping. I provided pseudocode, have thoroughly reviewed the generated code, and have verified that the underlying communication topology and implementation details align strictly with the approaches discussed in the related scientific literature on SWAP fusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant