Skip to content

Fuse uncontrolled distributed CPU prefix-suffix multi-SWAPs#786

Open
nathandelcid wants to merge 4 commits into
QuEST-Kit:develfrom
nathandelcid:swap_fusion
Open

Fuse uncontrolled distributed CPU prefix-suffix multi-SWAPs#786
nathandelcid wants to merge 4 commits into
QuEST-Kit:develfrom
nathandelcid:swap_fusion

Conversation

@nathandelcid

@nathandelcid nathandelcid commented Jun 8, 2026

Copy link
Copy Markdown

Closes #595

Summary

Fuses QuEST's uncontrolled distributed CPU prefix/suffix multi-SWAP path so each moved amplitude chunk is routed directly to its final MPI rank, using the existing per-node cpuCommBuffer.

This is intentionally narrower than a public multi-SWAP API:

  • no public API changes
  • no user-visible numerical behavior changes expected
  • CPU distributed uncontrolled multi-SWAPs are fused
  • GPU quregs, controlled multi-SWAPs, single swaps, and non-distributed cases stay on the existing path

Context

Before this PR, anyCtrlMultiSwapBetweenPrefixAndSuffix() performed each prefix/suffix SWAP sequentially.

for (size_t i=0; i<targsA.size(); i++) {
    if (targsA[i] == targsB[i])
        continue;

    int suffixTarg = std::min(targsA[i], targsB[i]);
    int prefixTarg = std::max(targsA[i], targsB[i]);
    anyCtrlSwapBetweenPrefixAndSuffix(qureg, ctrls, ctrlStates, suffixTarg, prefixTarg);
}

That is inefficient because a multi-target localiser operation can perform several communication rounds and repeatedly pack/exchange/unpack amplitudes that are only being routed to their eventual node.

PR #785 already improved the volume by routing each subset directly, but the maintainer noted that its subset loop still performs up to 2^eta - 1 sequential exchange rounds:

for (qindex sub=1; sub<numSubsets; sub++) {
    // find pair rank
    // pack targeted amps
    // exchange sub-buffer
    // unpack amps
}

This PR addresses that follow-up concern by batching destination subsets into at most two MPI exchange waves. The two-wave limit comes from the existing buffer layout: only half of cpuCommBuffer can be used for sends because the other half must receive incoming chunks.

Implementation Details

The fused path is selected only when all constraints needed by this implementation are true: at least two active prefix/suffix pairs, no controls, distributed CPU state, and no GPU acceleration.

if (
    suffixTargs.size() >= 2 &&
    ctrls.empty() &&
    qureg.isDistributed &&
    !qureg.isGpuAccelerated
) {
    multiSwapBetweenPrefixAndSuffix(qureg, suffixTargs, prefixInds);
    return;
}

Everything outside that scope keeps the existing per-SWAP implementation.

for (size_t i=0; i<suffixTargs.size(); i++)
    anyCtrlSwapBetweenPrefixAndSuffix(
        qureg,
        ctrls,
        ctrlStates,
        suffixTargs[i],
        prefixInds[i] + qureg.logNumAmpsPerNode);

The fused routine converts the active prefix targets into rank-bit patterns. It skips the local no-op pattern and computes the rank that should receive each non-local suffix pattern.

qindex localPrefixPattern = getRankPatternInPrefixInds(qureg.rank, prefixInds);
qindex numPatterns = powerOf2(suffixTargs.size());

for (qindex pattern=0; pattern<numPatterns; pattern++) {
    if (pattern == localPrefixPattern)
        continue;

    remotePatterns.push_back(pattern);
    pairRanks.push_back(getRankWithPrefixIndsInPattern(qureg.rank, prefixInds, pattern));
}

Chunks are sized by the number of suffix patterns. The first wave can fill at most the send half of cpuCommBuffer, so eta >= 2 needs no more than two waves for the 2^eta - 1 remote chunks.

qindex chunkSize = qureg.numAmpsPerNode / numPatterns;
qindex maxChunksPerWave = numPatterns / 2;

for (qindex firstChunk=0; firstChunk<(qindex) remotePatterns.size(); firstChunk += maxChunksPerWave) {
    qindex numChunks = std::min(
        maxChunksPerWave,
        (qindex) remotePatterns.size() - firstChunk);

    // pack this wave, exchange all chunks, then unpack this wave
}

Each destination suffix pattern is packed into an explicit send-buffer offset. The suffix qubits are sorted once and the pattern is converted to a bit mask, avoiding repeated sorting in the hot path.

qindex mask = getBitMaskOfQubitsInPattern(suffixTargs, pattern);
qindex bufferOffset = sendBase + c*chunkSize;

cpu_statevec_packAmpsIntoBufferAtOffset(
    qureg,
    sortedSuffixTargs,
    mask,
    bufferOffset);

comm_exchangeSubBufferChunks() exchanges equal-sized chunks using nonblocking MPI receives/sends and one MPI_Waitall per wave.

MPI_Irecv(
    &qureg.cpuCommBuffer[recvInd + messageOffset],
    messageSize,
    MPI_QCOMP,
    pairRanks[c],
    recvTag,
    mpiComm,
    &requests[reqInd++]);

MPI_Isend(
    &qureg.cpuCommBuffer[sendInd + messageOffset],
    messageSize,
    MPI_QCOMP,
    pairRanks[c],
    sendTag,
    mpiComm,
    &requests[reqInd++]);

MPI_Waitall(requests.size(), requests.data(), MPI_STATUSES_IGNORE);

MPI tags are derived from the sender/receiver rank patterns, not from local chunk order, so paired ranks agree on message matching even when their chunk vectors are ordered differently.

int recvTag = static_cast<int>(recvTagBases[c]*numMessages + m);
int sendTag = static_cast<int>(sendTagBase*numMessages + m);

Received chunks are unpacked from explicit receive-buffer offsets into their final local amplitude slots.

qindex mask = getBitMaskOfQubitsInPattern(suffixTargs, pattern);
qindex bufferOffset = recvBase + c*chunkSize;

cpu_statevec_unpackAmpsFromBufferAtOffset(
    qureg,
    sortedSuffixTargs,
    mask,
    bufferOffset);

Backend / API Impact

  • No public API changes.
  • No docs/examples changes needed because behavior is unchanged.
  • CPU distributed uncontrolled multi-SWAPs avoid sequential per-SWAP communication.
  • GPU and cuQuantum behavior is unchanged.
  • Controlled multi-SWAP behavior is unchanged.
  • Memory use remains within the existing cpuCommBuffer; extra metadata is limited to rank/pattern vectors.
  • Communication volume remains (1 - 1 / 2^eta) * numAmpsPerNode amplitudes per fused multi-SWAP per rank, with each moved amplitude crossing the network once.
  • Synchronization rounds are reduced from up to 2^eta - 1 subset exchanges to at most two MPI exchange waves.

Validation

Configured and built an MPI/OpenMP test build:

cmake -S . -B /tmp/quest-mpi-build \
  -DQUEST_ENABLE_MPI=ON \
  -DQUEST_ENABLE_OMP=ON \
  -DQUEST_BUILD_TESTS=ON

cmake --build /tmp/quest-mpi-build -j2

CMake found MPICH 4.1 and built the test binary successfully.

Targeted MPI correctness tests passed:

np=1:
  applySwap          passed
  applyCompMatr2     passed
  applyCompMatr      passed
  calcPartialTrace   passed

np=2:
  applySwap          passed
  applyCompMatr2     passed
  applyCompMatr      passed
  calcPartialTrace   passed

np=4:
  applySwap          passed
  applyCompMatr2     passed
  applyCompMatr      passed with OMP_NUM_THREADS=1
  calcPartialTrace   passed

np=8:
  applySwap          passed with OMP_NUM_THREADS=1
  applyCompMatr2     passed with OMP_NUM_THREADS=1
  applyCompMatr      passed with OMP_NUM_THREADS=1
  calcPartialTrace   passed with OMP_NUM_THREADS=1

Also verified:

cmake --build build -j2
./build/min_example
git diff --check

Notes

The larger np=4 and np=8 local runs used OMP_NUM_THREADS=1 to avoid oversubscribing this machine with MPI ranks times OpenMP threads. Those runs still exercise the distributed CPU paths affected by this change.

I have not validated the GPU path because this PR deliberately leaves GPU quregs on the existing per-SWAP implementation.

@nathandelcid nathandelcid changed the title Implement SWAP fusion chore: implement SWAP fusion Jun 8, 2026
@nathandelcid nathandelcid changed the base branch from main to devel June 8, 2026 16:25
@TysonRayJones

Copy link
Copy Markdown
Member

This is a beautiful diff! (even if it contains no comments 😉) 🎉 Will review ASAP

@nathandelcid nathandelcid changed the title chore: implement SWAP fusion Fuse uncontrolled distributed CPU prefix-suffix multi-SWAPs Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement SWAP fusion

2 participants