Update code to Optimise-away small GPU allocations for projector & mu… unitaryHACK26 by thedaemon-wizard · Pull Request #783 · QuEST-Kit/QuEST

thedaemon-wizard · 2026-06-07T14:35:33Z

Profile and optimise-away small GPU allocations

Closes #749

Summary

The single-GPU backend copied the qubit-index list from host to device
(cudaMalloc + cudaMemcpyAsync + cudaFree) on every call of several
multi-qubit operations, via the getDevInts() helper. For small Quregs this
fixed allocation latency dominates the actual kernel runtime — exactly the
overhead issue #749 asks us to profile and remove.

This PR eliminates that per-call copy for the two operations agreed in scope,
following the register-resident-bitmask pattern already used by
thrust_statevec_calcExpecAnyTargZ_sub:

Operation	Public API	Subroutine	Technique
Multi-qubit projector	`applyMultiQubitProjector`	`thrust_{statevec,densmatr}_multiQubitProjector_sub`	reformulate the per-qubit test as two primitive bitmasks (no list at all)
Multi-qubit outcome probability	`calcProbOfMultiQubitOutcome`	`thrust_{statevec,densmatr}_calcProbOfMultiQubitOutcome_sub`	carry the tiny sorted list by value inside the Thrust functor

All changes are confined to a single file: quest/src/gpu/gpu_thrust.cuh
(52 insertions, 52 deletions). No public API, dispatch, or thrust_* signatures
change — only the internals.

Design

Projectors — bitmask reformulation (removes the copy for all sizes)

A projector keeps an amplitude iff its target qubits match the requested
outcomes. The original device-array test

getValueOfBits(n, targetsPtr, numBits) == retainValue

is mathematically identical to the register-only primitive

(n & qubitMask) == valueMask

where qubitMask = util_getBitMask(qubits) flags the target positions and
valueMask = util_getBitMask(qubits, outcomes) holds the desired outcome bits.
Both masks are plain qindex scalars passed as kernel arguments, so no device
array is allocated at all. For the density-matrix functor the same masks are
applied to both the row and column substates:

qreal fac = renorm * ((r & qubitMask) == valueMask) * ((c & qubitMask) == valueMask);

Outcome probability — pass the list by value (removes the copy for all sizes)

calcProbOfMultiQubitOutcome uses functor_insertBits, whose
insertBitsWithMaskedValues is a scatter that needs the actual sorted qubit
positions (a bitmask is insufficient). Instead of allocating a
device_vector, the functor now stores the positions in a List64 by value
— a trivially-copyable, fixed-size (int[64]), CUDA-kernel-compatible struct
that already exists in the codebase precisely for this purpose
(quest/src/core/lists.hpp). The list rides along as a kernel argument, so
cudaMalloc/cudaMemcpy disappear for every qubit count (the previous code only
avoided the copy never — it always called getDevInts).

getDevInts() itself is untouched and still used by ~20 other operations in
gpu_subroutines.cpp that are out of scope for this issue.

Results (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120)

Nsight Systems — CUDA API call counts

Trace of N = 8…12, 200 reps each, both operations, captured with
Nsight Systems 2025.3.2 (nsys profile + per-call CUDA runtime API counts):

CUDA runtime call	baseline	optimised	change
`cudaMalloc`	3020	1010	−66%
`cudaFree`	3021	1011	−66%
`cudaMemcpyAsync`	3015	1005	−67%
`cudaLaunchKernel`	2025	2025	unchanged (identical compute)

The ~2000 eliminated allocations are exactly the two per-call qubit-list copies
(one in the projector, one in the probability path). The residual ~1000
cudaMalloc in the optimised build is thrust::reduce's own internal temporary
in calcProb — inherent to Thrust and out of scope. cudaLaunchKernel is
unchanged, confirming the kernels themselves are untouched.

Wall-clock per call (microseconds), `numTargs = 3`, 2000 reps

N	projector base	projector opt	speedup	prob base	prob opt	speedup
4	12.41	6.48	1.92×	20.35	14.59	1.39×
8	11.74	6.40	1.83×	19.44	14.18	1.37×
12	11.85	6.52	1.82×	19.79	14.44	1.37×
16	12.22	6.84	1.79×	28.63	23.21	1.23×
20	58.41	13.74	4.25×	67.99	61.99	1.10×

In the small-Qureg regime the projector is consistently ~1.8× faster and the
probability calc ~1.4× faster, purely from removing the allocation. (The
large baseline jump at N≥17 is the allocation interacting with the now-larger
state kernels; removing it also removes that cliff.)

Correctness

Built with -D QUEST_ENABLE_CUDA=ON -D QUEST_BUILD_TESTS=ON -D CMAKE_CUDA_ARCHITECTURES=120 -D CMAKE_BUILD_TYPE=Release against CUDA 13.0.
The unit tests for the affected operations pass across all four deployments
(CPU, CPU+OpenMP, GPU, GPU+OpenMP):

./build/tests/tests "*QubitProjector*,*calcProbOfMultiQubitOutcome*,*calcProbOfQubitOutcome*"
# All tests passed (57146 assertions in 8 test cases)

(A correct sm_120 build is also implicitly validated, since a wrong architecture
silently corrupts GPU results.)

Re-verified end-to-end on a freshly re-cloned devel (HEAD b9830592) with the
single-file change re-applied: configure, build, and the test command above all
pass unchanged.

How to reproduce the measurements

The numbers above come from a small standalone driver (kept out of this PR to
preserve the single-file diff) that, against both a clean origin/devel build and
this branch:

builds QuEST with -D QUEST_ENABLE_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES=120 -D CMAKE_BUILD_TYPE=Release;
for each N, allocates a Qureg, then times a loop of applyMultiQubitProjector
and calcProbOfMultiQubitOutcome (numTargs = 3) with a CUDA-synchronised
wall clock over many reps (per-call µs in the table);
wraps a shorter run under nsys profile and tallies CUDA runtime API calls
(cudaMalloc/cudaFree/cudaMemcpyAsync/cudaLaunchKernel) for the table above.

Happy to share the driver/scripts separately if useful for CI; they are not part of
this change.

Notes for reviewers

Base branch is devel (the active unitaryHACK branch and the only one that
builds on CUDA 13 — main/v4.2 still uses thrust::binary_function, removed
in CUDA 13's libcu++).
This optimises QuEST's native Thrust GPU backend (the path taken when
cuQuantum is not enabled). cuStateVec is a separate backend, not what Profile and optimise-away small GPU allocations #749
concerns; it was left disabled for these measurements. (For the record, as of
2026 cuStateVec does support CUDA 13 and Blackwell, so this is a scope choice,
not a compatibility workaround.)
This aligns with the in-flight "James' GPU refactor" placeholder note on
getDevInts in gpu_thrust.cuh.

AI usage disclosure

Per unitaryHACK's AI guide ("human-in-the-loop";
honesty required for bounty eligibility): an AI coding assistant (Anthropic Claude,
via Claude Code) was used as a co-pilot for parts of this work — to help survey the
gpu_thrust.cuh code paths, brainstorm the bitmask/List64-by-value reformulation,
and draft this PR description and the profiling methodology. It
was not the author of record: every change was reviewed, compiled, and tested by
me on real hardware (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120). The diff is a single
file (+52/−52), the algebraic equivalence of the bitmask reformulation was checked by
hand, and correctness was confirmed by the upstream unit tests passing across all four
deployments (CPU / CPU+OpenMP / GPU / GPU+OpenMP). No unverified or copy-pasted AI
output is included.

unitaryHACK 2026 checklist

PR description links the issue (Closes #749).
Code is compiled and tested on real hardware (not unverified AI output).
AI assistance disclosed (see "AI usage disclosure" above), per the AI guide.
Scope kept tight; ≤ 4 open PRs; GitHub activity public.

…lti-qubit prob

TysonRayJones · 2026-06-09T04:03:41Z

This is a wonderful diff - I'm kicking myself for not noticing functor_projectStateVec wasn't even leveraging orderedness! 🎉

Can you please share the mentioned driver/scripts for benchmarking? Can either whack it into a comment here, or include it into the diff (which we can delete later - changes will be squashed so it won't pollute your work).

TysonRayJones · 2026-06-09T04:10:19Z

Note to self

The template parameters of the below functions and functors are now redundant:

functor_projectStateVec
functor_projectDensMatr
thrust_statevec_multiQubitProjector_sub
thrust_densmatr_multiQubitProjector_sub
gpu_statevec_multiQubitProjector_sub
gpu_densmatr_multiQubitProjector_sub

They can all be removed, along with the parameter dispatch in accel_statevec_multiQubitProjector_sub and accel_densmatr_multiQubitProjector_sub. I can do this myself in a cleanup commit (unless @thedaemon-wizard wishes to do it!)

…review) After the bitmask reformulation the projector no longer specialises on the target count, so its numTargs template is dead code: - drop the template from functor_projectStateVec/functor_projectDensMatr, thrust_{statevec,densmatr}_multiQubitProjector_sub and gpu_{statevec,densmatr}_multiQubitProjector_sub, and their INSTANTIATE_FUNC_OPTIMISED_FOR_NUM_TARGS instantiations; - apply the same bitmask reformulation to the CPU projector (cpu_{statevec,densmatr}_multiQubitProjector_sub) so its template goes too; - simplify accel_{statevec,densmatr}_multiQubitProjector_sub to a plain isGpuAccelerated ? gpu_ : cpu_ branch (no GET_CPU_OR_GPU_FUNC dispatch). Shared dispatch macros and the calcProb* template chain are untouched. Unit tests pass on CPU/CPU+OMP/GPU/GPU+OMP (57146 assertions). Adds a throw-away benchmarks/ driver (not wired into CMake/CI); safe to squash/drop on merge.

thedaemon-wizard · 2026-06-09T07:28:03Z

This is a wonderful diff - I'm kicking myself for not noticing functor_projectStateVec wasn't even leveraging orderedness! 🎉

Can you please share the mentioned driver/scripts for benchmarking? Can either whack it into a comment here, or include it into the diff (which we can delete later - changes will be squashed so it won't pollute your work).

Thanks @TysonRayJones! Glad it's useful. 🎉

I've added the driver to the PR under benchmarks/benchmark_749.cpp (with a
short benchmarks/README.md). It's deliberately not wired into CMake/CI — happy
for it to be deleted in the squash, as you suggested. It builds straight against
QuEST via the built-in USER_SOURCE_NAMES mechanism:

cmake -S . -B build_bench \
    -D QUEST_ENABLE_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES=120 \
    -D CMAKE_BUILD_TYPE=Release \
    -D USER_SOURCE_NAMES=benchmarks/benchmark_749.cpp \
    -D USER_OUTPUT_EXE_NAME=bench_749
cmake --build build_bench --target bench_749 -j
./build_bench/bench_749 4 20 3 2000        # [minQ maxQ numTargs reps]

It forces the single-GPU path (useGpuAccel=1, distribution/threads off) and
syncQuESTEnv()s around each timed region so it measures completed GPU work. Build
it once against clean origin/devel and once against this branch for before/after.

On my machine (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120):

CUDA runtime API counts (N = 8…12, 200 reps, both ops, via nsys):

CUDA runtime call	baseline	optimised	change
`cudaMalloc`	3020	1010	−66%
`cudaFree`	3021	1011	−66%
`cudaMemcpyAsync`	3015	1005	−67%
`cudaLaunchKernel`	2025	2025	unchanged

Per-call wall time (µs, numTargs = 3, 2000 reps):

N	projector base	projector opt	speedup	prob base	prob opt	speedup
4	12.41	6.48	1.92×	20.35	14.59	1.39×
8	11.74	6.40	1.83×	19.44	14.18	1.37×
12	11.85	6.52	1.82×	19.79	14.44	1.37×
16	12.22	6.84	1.79×	28.63	23.21	1.23×
20	58.41	13.74	4.25×	67.99	61.99	1.10×

(The residual ~1000 cudaMalloc in the optimised build is thrust::reduce's own
internal temporary inside calcProb — inherent to Thrust, out of scope here.)

thedaemon-wizard · 2026-06-09T07:29:17Z

Note to self

The template parameters of the below functions and functors are now redundant:
* `functor_projectStateVec`

* `functor_projectDensMatr`

* `thrust_statevec_multiQubitProjector_sub`

* `thrust_densmatr_multiQubitProjector_sub`

* `gpu_statevec_multiQubitProjector_sub`

* `gpu_densmatr_multiQubitProjector_sub`
They can all be removed, along with the parameter dispatch in accel_statevec_multiQubitProjector_sub and accel_densmatr_multiQubitProjector_sub. I can do this myself in a cleanup commit (unless @thedaemon-wizard wishes to do it!)

Done — I went ahead and removed them (pushed in a follow-up commit). Summary:

Dropped the now-dead template parameter from functor_projectStateVec,
functor_projectDensMatr, thrust_{statevec,densmatr}_multiQubitProjector_sub,
and gpu_{statevec,densmatr}_multiQubitProjector_sub, and removed their
INSTANTIATE_FUNC_OPTIMISED_FOR_NUM_TARGS instantiations.
Simplified accel_{statevec,densmatr}_multiQubitProjector_sub to a plain
qureg.isGpuAccelerated ? gpu_… : cpu_… branch (matching the style already used
elsewhere in accelerator.cpp), so the GET_CPU_OR_GPU_FUNC_OPTIMISED_FOR_ONE_PARAM
dispatch is gone for the projector. I left the shared dispatch macros untouched
since packAmpsIntoBuffer, partialTrace_sub and the calcProb* family still
rely on them.

One thing to confirm: your note listed the GPU-side functions, but the
GET_CPU_OR_GPU_FUNC_… dispatch also fans out to the CPU projector, which was
still using its template param (SET_VAR_AT_COMPILE_TIME to unroll
getValueOfBits). To remove the dispatch cleanly I applied the same bitmask
reformulation to the CPU projector too:
getValueOfBits(n, qubits) == retainValue ≡ (n & qubitMask) == valueMask
(and the density-matrix (v1==v2) && (retainValue==v1) becomes
(r & qubitMask)==valueMask && (c & qubitMask)==valueMask), with
qubitMask = util_getBitMask(qubits) and valueMask = util_getBitMask(qubits, outcomes).
That makes the CPU template redundant as well, so it could be removed symmetrically.
If you'd rather keep the CPU path templated/unrolled, say the word and I'll instead
branch only the GPU side in accel_* and leave the CPU dispatch as-is.

A couple of notes for the record:

The CPU reformulation drops the per-amp inner loop over targets in favour of two
mask compares, so it shouldn't regress (and removes the getValueOfBits unroll
entirely); happy to micro-benchmark the CPU side if useful.
Removing the seven <0>…<5>,<-1> instantiations per projector function also
trims a little compile time / object-code, with no runtime cost since the
projector no longer benefits from compile-time numTargs unrolling.

Verification (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120, Release): rebuilt
clean, and the affected unit tests pass across all four deployments —
tests "*QubitProjector*,*calcProbOfMultiQubitOutcome*,*calcProbOfQubitOutcome*"
→ All tests passed (57146 assertions in 8 test cases) (CPU / CPU+OpenMP / GPU /
GPU+OpenMP).

Also confirming: I'm fine with the benchmarks/ driver being deleted in the squash —
just let me know if you'd prefer I drop it from the branch now instead.

Update code to Optimise-away small GPU allocations for projector & mu…

1bdb7a8

…lti-qubit prob

thedaemon-wizard changed the title ~~Update code to Optimise-away small GPU allocations for projector & mu…~~ Update code to Optimise-away small GPU allocations for projector & mu… unitaryHACK26 Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update code to Optimise-away small GPU allocations for projector & mu… unitaryHACK26#783

Update code to Optimise-away small GPU allocations for projector & mu… unitaryHACK26#783
thedaemon-wizard wants to merge 2 commits into
QuEST-Kit:develfrom
thedaemon-wizard:optimise-small-gpu-allocations-749

thedaemon-wizard commented Jun 7, 2026

Uh oh!

TysonRayJones commented Jun 9, 2026

Uh oh!

TysonRayJones commented Jun 9, 2026

Uh oh!

thedaemon-wizard commented Jun 9, 2026 •

edited

Loading

Uh oh!

thedaemon-wizard commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thedaemon-wizard commented Jun 7, 2026

Profile and optimise-away small GPU allocations

Summary

Design

Projectors — bitmask reformulation (removes the copy for all sizes)

Outcome probability — pass the list by value (removes the copy for all sizes)

Results (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120)

Nsight Systems — CUDA API call counts

Wall-clock per call (microseconds), numTargs = 3, 2000 reps

Correctness

How to reproduce the measurements

Notes for reviewers

AI usage disclosure

unitaryHACK 2026 checklist

Uh oh!

TysonRayJones commented Jun 9, 2026

Uh oh!

TysonRayJones commented Jun 9, 2026

Uh oh!

thedaemon-wizard commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thedaemon-wizard commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Wall-clock per call (microseconds), `numTargs = 3`, 2000 reps

thedaemon-wizard commented Jun 9, 2026 •

edited

Loading

thedaemon-wizard commented Jun 9, 2026 •

edited

Loading