Fuse uncontrolled distributed CPU prefix-suffix multi-SWAPs by nathandelcid · Pull Request #786 · QuEST-Kit/QuEST

nathandelcid · 2026-06-08T15:53:43Z

Closes #595

Summary

Fuses QuEST's uncontrolled distributed CPU prefix/suffix multi-SWAP path so each moved amplitude chunk is routed directly to its final MPI rank, using the existing per-node cpuCommBuffer.

This is intentionally narrower than a public multi-SWAP API:

no public API changes
no user-visible numerical behavior changes expected
CPU distributed uncontrolled multi-SWAPs are fused
GPU quregs, controlled multi-SWAPs, single swaps, and non-distributed cases stay on the existing path

Context

Before this PR, anyCtrlMultiSwapBetweenPrefixAndSuffix() performed each prefix/suffix SWAP sequentially.

for (size_t i=0; i<targsA.size(); i++) {
    if (targsA[i] == targsB[i])
        continue;

    int suffixTarg = std::min(targsA[i], targsB[i]);
    int prefixTarg = std::max(targsA[i], targsB[i]);
    anyCtrlSwapBetweenPrefixAndSuffix(qureg, ctrls, ctrlStates, suffixTarg, prefixTarg);
}

That is inefficient because a multi-target localiser operation can perform several communication rounds and repeatedly pack/exchange/unpack amplitudes that are only being routed to their eventual node.

PR #785 already improved the volume by routing each subset directly, but the maintainer noted that its subset loop still performs up to 2^eta - 1 sequential exchange rounds:

for (qindex sub=1; sub<numSubsets; sub++) {
    // find pair rank
    // pack targeted amps
    // exchange sub-buffer
    // unpack amps
}

This PR addresses that follow-up concern by batching destination subsets into at most two MPI exchange waves. The two-wave limit comes from the existing buffer layout: only half of cpuCommBuffer can be used for sends because the other half must receive incoming chunks.

Implementation Details

The fused path is selected only when all constraints needed by this implementation are true: at least two active prefix/suffix pairs, no controls, distributed CPU state, and no GPU acceleration.

if (
    suffixTargs.size() >= 2 &&
    ctrls.empty() &&
    qureg.isDistributed &&
    !qureg.isGpuAccelerated
) {
    multiSwapBetweenPrefixAndSuffix(qureg, suffixTargs, prefixInds);
    return;
}

Everything outside that scope keeps the existing per-SWAP implementation.

for (size_t i=0; i<suffixTargs.size(); i++)
    anyCtrlSwapBetweenPrefixAndSuffix(
        qureg,
        ctrls,
        ctrlStates,
        suffixTargs[i],
        prefixInds[i] + qureg.logNumAmpsPerNode);

The fused routine converts the active prefix targets into rank-bit patterns. It skips the local no-op pattern and computes the rank that should receive each non-local suffix pattern.

qindex localPrefixPattern = getRankPatternInPrefixInds(qureg.rank, prefixInds);
qindex numPatterns = powerOf2(suffixTargs.size());

for (qindex pattern=0; pattern<numPatterns; pattern++) {
    if (pattern == localPrefixPattern)
        continue;

    remotePatterns.push_back(pattern);
    pairRanks.push_back(getRankWithPrefixIndsInPattern(qureg.rank, prefixInds, pattern));
}

Chunks are sized by the number of suffix patterns. The first wave can fill at most the send half of cpuCommBuffer, so eta >= 2 needs no more than two waves for the 2^eta - 1 remote chunks.

qindex chunkSize = qureg.numAmpsPerNode / numPatterns;
qindex maxChunksPerWave = numPatterns / 2;

for (qindex firstChunk=0; firstChunk<(qindex) remotePatterns.size(); firstChunk += maxChunksPerWave) {
    qindex numChunks = std::min(
        maxChunksPerWave,
        (qindex) remotePatterns.size() - firstChunk);

    // pack this wave, exchange all chunks, then unpack this wave
}

Each destination suffix pattern is packed into an explicit send-buffer offset. The suffix qubits are sorted once and the pattern is converted to a bit mask, avoiding repeated sorting in the hot path.

qindex mask = getBitMaskOfQubitsInPattern(suffixTargs, pattern);
qindex bufferOffset = sendBase + c*chunkSize;

cpu_statevec_packAmpsIntoBufferAtOffset(
    qureg,
    sortedSuffixTargs,
    mask,
    bufferOffset);

comm_exchangeSubBufferChunks() exchanges equal-sized chunks using nonblocking MPI receives/sends and one MPI_Waitall per wave.

MPI_Irecv(
    &qureg.cpuCommBuffer[recvInd + messageOffset],
    messageSize,
    MPI_QCOMP,
    pairRanks[c],
    recvTag,
    mpiComm,
    &requests[reqInd++]);

MPI_Isend(
    &qureg.cpuCommBuffer[sendInd + messageOffset],
    messageSize,
    MPI_QCOMP,
    pairRanks[c],
    sendTag,
    mpiComm,
    &requests[reqInd++]);

MPI_Waitall(requests.size(), requests.data(), MPI_STATUSES_IGNORE);

MPI tags are derived from the sender/receiver rank patterns, not from local chunk order, so paired ranks agree on message matching even when their chunk vectors are ordered differently.

int recvTag = static_cast<int>(recvTagBases[c]*numMessages + m);
int sendTag = static_cast<int>(sendTagBase*numMessages + m);

Received chunks are unpacked from explicit receive-buffer offsets into their final local amplitude slots.

qindex mask = getBitMaskOfQubitsInPattern(suffixTargs, pattern);
qindex bufferOffset = recvBase + c*chunkSize;

cpu_statevec_unpackAmpsFromBufferAtOffset(
    qureg,
    sortedSuffixTargs,
    mask,
    bufferOffset);

Backend / API Impact

No public API changes.
No docs/examples changes needed because behavior is unchanged.
CPU distributed uncontrolled multi-SWAPs avoid sequential per-SWAP communication.
GPU and cuQuantum behavior is unchanged.
Controlled multi-SWAP behavior is unchanged.
Memory use remains within the existing cpuCommBuffer; extra metadata is limited to rank/pattern vectors.
Communication volume remains (1 - 1 / 2^eta) * numAmpsPerNode amplitudes per fused multi-SWAP per rank, with each moved amplitude crossing the network once.
Synchronization rounds are reduced from up to 2^eta - 1 subset exchanges to at most two MPI exchange waves.

Validation

Configured and built an MPI/OpenMP test build:

cmake -S . -B /tmp/quest-mpi-build \
  -DQUEST_ENABLE_MPI=ON \
  -DQUEST_ENABLE_OMP=ON \
  -DQUEST_BUILD_TESTS=ON

cmake --build /tmp/quest-mpi-build -j2

CMake found MPICH 4.1 and built the test binary successfully.

Targeted MPI correctness tests passed:

np=1:
  applySwap          passed
  applyCompMatr2     passed
  applyCompMatr      passed
  calcPartialTrace   passed

np=2:
  applySwap          passed
  applyCompMatr2     passed
  applyCompMatr      passed
  calcPartialTrace   passed

np=4:
  applySwap          passed
  applyCompMatr2     passed
  applyCompMatr      passed with OMP_NUM_THREADS=1
  calcPartialTrace   passed

np=8:
  applySwap          passed with OMP_NUM_THREADS=1
  applyCompMatr2     passed with OMP_NUM_THREADS=1
  applyCompMatr      passed with OMP_NUM_THREADS=1
  calcPartialTrace   passed with OMP_NUM_THREADS=1

Also verified:

cmake --build build -j2
./build/min_example
git diff --check

Notes

The larger np=4 and np=8 local runs used OMP_NUM_THREADS=1 to avoid oversubscribing this machine with MPI ranks times OpenMP threads. Those runs still exercise the distributed CPU paths affected by this change.

I have not validated the GPU path because this PR deliberately leaves GPU quregs on the existing per-SWAP implementation.

TysonRayJones · 2026-06-09T02:54:25Z

This is a beautiful diff! (even if it contains no comments 😉) 🎉 Will review ASAP

nathandelcid added 2 commits June 6, 2026 13:37

feat: new lifecycle tutorial

2dae90e

chore: implement SWAP fusion

f69c3ff

nathandelcid changed the title ~~Implement SWAP fusion~~ chore: implement SWAP fusion Jun 8, 2026

nathandelcid changed the base branch from main to devel June 8, 2026 16:25

delete redundant ex

174c413

nathandelcid changed the title ~~chore: implement SWAP fusion~~ Fuse uncontrolled distributed CPU prefix-suffix multi-SWAPs Jun 9, 2026

Fix multi-swap fallback target handling

1a844e9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse uncontrolled distributed CPU prefix-suffix multi-SWAPs#786

Fuse uncontrolled distributed CPU prefix-suffix multi-SWAPs#786
nathandelcid wants to merge 4 commits into
QuEST-Kit:develfrom
nathandelcid:swap_fusion

nathandelcid commented Jun 8, 2026 •

edited

Loading

Uh oh!

TysonRayJones commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nathandelcid commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Implementation Details

Backend / API Impact

Validation

Notes

Uh oh!

TysonRayJones commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nathandelcid commented Jun 8, 2026 •

edited

Loading