Fuse uncontrolled distributed CPU prefix-suffix multi-SWAPs#786
Open
nathandelcid wants to merge 4 commits into
Open
Fuse uncontrolled distributed CPU prefix-suffix multi-SWAPs#786nathandelcid wants to merge 4 commits into
nathandelcid wants to merge 4 commits into
Conversation
Member
|
This is a beautiful diff! (even if it contains no comments 😉) 🎉 Will review ASAP |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #595
Summary
Fuses QuEST's uncontrolled distributed CPU prefix/suffix multi-SWAP path so each moved amplitude chunk is routed directly to its final MPI rank, using the existing per-node
cpuCommBuffer.This is intentionally narrower than a public multi-SWAP API:
quregs, controlled multi-SWAPs, single swaps, and non-distributed cases stay on the existing pathContext
Before this PR,
anyCtrlMultiSwapBetweenPrefixAndSuffix()performed each prefix/suffix SWAP sequentially.That is inefficient because a multi-target localiser operation can perform several communication rounds and repeatedly pack/exchange/unpack amplitudes that are only being routed to their eventual node.
PR #785 already improved the volume by routing each subset directly, but the maintainer noted that its subset loop still performs up to
2^eta - 1sequential exchange rounds:This PR addresses that follow-up concern by batching destination subsets into at most two MPI exchange waves. The two-wave limit comes from the existing buffer layout: only half of
cpuCommBuffercan be used for sends because the other half must receive incoming chunks.Implementation Details
The fused path is selected only when all constraints needed by this implementation are true: at least two active prefix/suffix pairs, no controls, distributed CPU state, and no GPU acceleration.
Everything outside that scope keeps the existing per-SWAP implementation.
The fused routine converts the active prefix targets into rank-bit patterns. It skips the local no-op pattern and computes the rank that should receive each non-local suffix pattern.
Chunks are sized by the number of suffix patterns. The first wave can fill at most the send half of
cpuCommBuffer, soeta >= 2needs no more than two waves for the2^eta - 1remote chunks.Each destination suffix pattern is packed into an explicit send-buffer offset. The suffix qubits are sorted once and the pattern is converted to a bit mask, avoiding repeated sorting in the hot path.
qindex mask = getBitMaskOfQubitsInPattern(suffixTargs, pattern); qindex bufferOffset = sendBase + c*chunkSize; cpu_statevec_packAmpsIntoBufferAtOffset( qureg, sortedSuffixTargs, mask, bufferOffset);comm_exchangeSubBufferChunks()exchanges equal-sized chunks using nonblocking MPI receives/sends and oneMPI_Waitallper wave.MPI tags are derived from the sender/receiver rank patterns, not from local chunk order, so paired ranks agree on message matching even when their chunk vectors are ordered differently.
Received chunks are unpacked from explicit receive-buffer offsets into their final local amplitude slots.
qindex mask = getBitMaskOfQubitsInPattern(suffixTargs, pattern); qindex bufferOffset = recvBase + c*chunkSize; cpu_statevec_unpackAmpsFromBufferAtOffset( qureg, sortedSuffixTargs, mask, bufferOffset);Backend / API Impact
cpuCommBuffer; extra metadata is limited to rank/pattern vectors.(1 - 1 / 2^eta) * numAmpsPerNodeamplitudes per fused multi-SWAP per rank, with each moved amplitude crossing the network once.2^eta - 1subset exchanges to at most two MPI exchange waves.Validation
Configured and built an MPI/OpenMP test build:
cmake -S . -B /tmp/quest-mpi-build \ -DQUEST_ENABLE_MPI=ON \ -DQUEST_ENABLE_OMP=ON \ -DQUEST_BUILD_TESTS=ON cmake --build /tmp/quest-mpi-build -j2CMake found MPICH 4.1 and built the test binary successfully.
Targeted MPI correctness tests passed:
Also verified:
Notes
The larger
np=4andnp=8local runs usedOMP_NUM_THREADS=1to avoid oversubscribing this machine with MPI ranks times OpenMP threads. Those runs still exercise the distributed CPU paths affected by this change.I have not validated the GPU path because this PR deliberately leaves GPU
quregson the existing per-SWAP implementation.