Update code to Optimise-away small GPU allocations for projector & mu… unitaryHACK26#783
Conversation
|
This is a wonderful diff - I'm kicking myself for not noticing Can you please share the mentioned driver/scripts for benchmarking? Can either whack it into a comment here, or include it into the diff (which we can delete later - changes will be squashed so it won't pollute your work). |
|
Note to self The template parameters of the below functions and functors are now redundant:
They can all be removed, along with the parameter dispatch in |
…review)
After the bitmask reformulation the projector no longer specialises on the target
count, so its numTargs template is dead code:
- drop the template from functor_projectStateVec/functor_projectDensMatr,
thrust_{statevec,densmatr}_multiQubitProjector_sub and
gpu_{statevec,densmatr}_multiQubitProjector_sub, and their
INSTANTIATE_FUNC_OPTIMISED_FOR_NUM_TARGS instantiations;
- apply the same bitmask reformulation to the CPU projector
(cpu_{statevec,densmatr}_multiQubitProjector_sub) so its template goes too;
- simplify accel_{statevec,densmatr}_multiQubitProjector_sub to a plain
isGpuAccelerated ? gpu_ : cpu_ branch (no GET_CPU_OR_GPU_FUNC dispatch).
Shared dispatch macros and the calcProb* template chain are untouched. Unit tests
pass on CPU/CPU+OMP/GPU/GPU+OMP (57146 assertions). Adds a throw-away benchmarks/
driver (not wired into CMake/CI); safe to squash/drop on merge.
Thanks @TysonRayJones! Glad it's useful. 🎉 I've added the driver to the PR under cmake -S . -B build_bench \
-D QUEST_ENABLE_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES=120 \
-D CMAKE_BUILD_TYPE=Release \
-D USER_SOURCE_NAMES=benchmarks/benchmark_749.cpp \
-D USER_OUTPUT_EXE_NAME=bench_749
cmake --build build_bench --target bench_749 -j
./build_bench/bench_749 4 20 3 2000 # [minQ maxQ numTargs reps]It forces the single-GPU path ( On my machine (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120): CUDA runtime API counts (N = 8…12, 200 reps, both ops, via
Per-call wall time (µs, numTargs = 3, 2000 reps):
(The residual ~1000 |
Done — I went ahead and removed them (pushed in a follow-up commit). Summary:
One thing to confirm: your note listed the GPU-side functions, but the A couple of notes for the record:
Verification (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120, Release): rebuilt Also confirming: I'm fine with the |
Profile and optimise-away small GPU allocations
Closes #749
Summary
The single-GPU backend copied the qubit-index list from host to device
(
cudaMalloc+cudaMemcpyAsync+cudaFree) on every call of severalmulti-qubit operations, via the
getDevInts()helper. For small Quregs thisfixed allocation latency dominates the actual kernel runtime — exactly the
overhead issue #749 asks us to profile and remove.
This PR eliminates that per-call copy for the two operations agreed in scope,
following the register-resident-bitmask pattern already used by
thrust_statevec_calcExpecAnyTargZ_sub:applyMultiQubitProjectorthrust_{statevec,densmatr}_multiQubitProjector_subcalcProbOfMultiQubitOutcomethrust_{statevec,densmatr}_calcProbOfMultiQubitOutcome_subAll changes are confined to a single file:
quest/src/gpu/gpu_thrust.cuh(52 insertions, 52 deletions). No public API, dispatch, or
thrust_*signatureschange — only the internals.
Design
Projectors — bitmask reformulation (removes the copy for all sizes)
A projector keeps an amplitude iff its target qubits match the requested
outcomes. The original device-array test
getValueOfBits(n, targetsPtr, numBits) == retainValueis mathematically identical to the register-only primitive
where
qubitMask = util_getBitMask(qubits)flags the target positions andvalueMask = util_getBitMask(qubits, outcomes)holds the desired outcome bits.Both masks are plain
qindexscalars passed as kernel arguments, so no devicearray is allocated at all. For the density-matrix functor the same masks are
applied to both the row and column substates:
Outcome probability — pass the list by value (removes the copy for all sizes)
calcProbOfMultiQubitOutcomeusesfunctor_insertBits, whoseinsertBitsWithMaskedValuesis a scatter that needs the actual sorted qubitpositions (a bitmask is insufficient). Instead of allocating a
device_vector, the functor now stores the positions in aList64by value— a trivially-copyable, fixed-size (
int[64]), CUDA-kernel-compatible structthat already exists in the codebase precisely for this purpose
(
quest/src/core/lists.hpp). The list rides along as a kernel argument, socudaMalloc/cudaMemcpydisappear for every qubit count (the previous code onlyavoided the copy never — it always called
getDevInts).getDevInts()itself is untouched and still used by ~20 other operations ingpu_subroutines.cppthat are out of scope for this issue.Results (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120)
Nsight Systems — CUDA API call counts
Trace of N = 8…12, 200 reps each, both operations, captured with
Nsight Systems 2025.3.2 (
nsys profile+ per-call CUDA runtime API counts):cudaMalloccudaFreecudaMemcpyAsynccudaLaunchKernelThe ~2000 eliminated allocations are exactly the two per-call qubit-list copies
(one in the projector, one in the probability path). The residual ~1000
cudaMallocin the optimised build isthrust::reduce's own internal temporaryin
calcProb— inherent to Thrust and out of scope.cudaLaunchKernelisunchanged, confirming the kernels themselves are untouched.
Wall-clock per call (microseconds),
numTargs = 3, 2000 repsIn the small-Qureg regime the projector is consistently ~1.8× faster and the
probability calc ~1.4× faster, purely from removing the allocation. (The
large baseline jump at N≥17 is the allocation interacting with the now-larger
state kernels; removing it also removes that cliff.)
Correctness
Built with
-D QUEST_ENABLE_CUDA=ON -D QUEST_BUILD_TESTS=ON -D CMAKE_CUDA_ARCHITECTURES=120 -D CMAKE_BUILD_TYPE=Releaseagainst CUDA 13.0.The unit tests for the affected operations pass across all four deployments
(CPU, CPU+OpenMP, GPU, GPU+OpenMP):
(A correct sm_120 build is also implicitly validated, since a wrong architecture
silently corrupts GPU results.)
Re-verified end-to-end on a freshly re-cloned
devel(HEADb9830592) with thesingle-file change re-applied: configure, build, and the test command above all
pass unchanged.
How to reproduce the measurements
The numbers above come from a small standalone driver (kept out of this PR to
preserve the single-file diff) that, against both a clean
origin/develbuild andthis branch:
-D QUEST_ENABLE_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES=120 -D CMAKE_BUILD_TYPE=Release;N, allocates a Qureg, then times a loop ofapplyMultiQubitProjectorand
calcProbOfMultiQubitOutcome(numTargs = 3) with a CUDA-synchronisedwall clock over many reps (per-call µs in the table);
nsys profileand tallies CUDA runtime API calls(
cudaMalloc/cudaFree/cudaMemcpyAsync/cudaLaunchKernel) for the table above.Happy to share the driver/scripts separately if useful for CI; they are not part of
this change.
Notes for reviewers
devel(the active unitaryHACK branch and the only one thatbuilds on CUDA 13 —
main/v4.2 still usesthrust::binary_function, removedin CUDA 13's libcu++).
cuQuantum is not enabled). cuStateVec is a separate backend, not what Profile and optimise-away small GPU allocations #749
concerns; it was left disabled for these measurements. (For the record, as of
2026 cuStateVec does support CUDA 13 and Blackwell, so this is a scope choice,
not a compatibility workaround.)
getDevIntsingpu_thrust.cuh.AI usage disclosure
Per unitaryHACK's AI guide ("human-in-the-loop";
honesty required for bounty eligibility): an AI coding assistant (Anthropic Claude,
via Claude Code) was used as a co-pilot for parts of this work — to help survey the
gpu_thrust.cuhcode paths, brainstorm the bitmask/List64-by-value reformulation,and draft this PR description and the profiling methodology. It
was not the author of record: every change was reviewed, compiled, and tested by
me on real hardware (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120). The diff is a single
file (+52/−52), the algebraic equivalence of the bitmask reformulation was checked by
hand, and correctness was confirmed by the upstream unit tests passing across all four
deployments (CPU / CPU+OpenMP / GPU / GPU+OpenMP). No unverified or copy-pasted AI
output is included.
unitaryHACK 2026 checklist
Closes #749).