Implement CPU Fused Multi-SWAP with Lightweight Pipeline Buffer Reuse (contributes to #595)#794
Open
zhangchuan92910 wants to merge 1 commit into
Open
Implement CPU Fused Multi-SWAP with Lightweight Pipeline Buffer Reuse (contributes to #595)#794zhangchuan92910 wants to merge 1 commit into
zhangchuan92910 wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Close #595
Overview
This PR implements Fused Multi-SWAP (SWAP fusion) for the distributed CPU backend, directly addressing the communication bottleneck outlined in #595. Instead of executing disjoint SWAPs sequentially (which wastefully forces amplitudes to cross the network multiple times), this implementation determines the final destination of each amplitude and sends it exactly once.
Algorithmic Approach & Acknowledgements
First, I would like to acknowledge the excellent work in PR #785 and PR #790. The core communication topology utilized in this PR shares the exact same mathematical foundation—the$2^k-1$ subcube XOR partner exchanges—which is a brilliant algorithmic choice for hypercube topologies.
However, this PR introduces a distinct engineering and memory-management philosophy that provides an alternative, highly optimized solution:
1. Lightweight Pipeline Buffer Reuse (Zero Dynamic Allocation)
Instead of batching all$2^k-1$ $O(N/2^k)$ buffer size per step. We can simply split the pre-existing
MPI_Isend/Irecvrequests at once (which inherently requires allocating and managing a large, dynamic staging cache likefusedSwapSendCacheto hold all payloads concurrently), this PR uses a sequential communication pipeline.Because we process one XOR partner at a time, we only need a strict
qureg.cpuCommBufferintosendandrecvhalves.mallocorstd::vector), no global state cache to maintain, and a drastically reduced memory footprint, completely eliminating OOM risks on memory-constrained HPC environments.2. Trade-offs in MPI Wait Latency
We acknowledge the trade-off in our pipeline approach regarding MPI latency. PR #790's approach of throwing all$2^k-1$ target exchanges to MPI concurrently and doing a single $2^k-1$ is generally very small (mostly $k \le 4$ ), we believe trading a few extra rounds of network latency for absolute $O(1)$ memory safety and zero memory allocation overhead is a highly practical and robust choice for the CPU backend.
MPI_Waitallhelps to hide network latency in massive clusters. Our sequential pipeline invokesMPI_Waitallonce per target iteration. However, because the number of rounds3. Native Compile-Time Unrolling
The buffer packing and unpacking kernels are tightly integrated with QuEST's native$k \le 5$ , ensuring the CPU benefits from branchless SIMD auto-vectorization.
SET_VAR_AT_COMPILE_TIMEmacro. The inner packing loops are fully unrolled at compile time for4. Safety Fallback for GPUs
While this PR focuses strictly on the CPU distributed logic (GPU acceleration is currently out of scope for this specific PR), it safely handles GPU-accelerated Quregs by injecting
syncQuregFromGpu()andsyncQuregToGpu()to ensure data consistency without breaking execution.Verification & Benchmarks
AI disclosure
This code was developed with the assistance of Google Gemini for algorithmic refactoring, performance analysis, and optimization prototyping. I provided pseudocode, have thoroughly reviewed the generated code, and have verified that the underlying communication topology and implementation details align strictly with the approaches discussed in the related scientific literature on SWAP fusion.