Skip to content

Improve applyCompMatr summation accuracy#781

Open
LubuSeb wants to merge 1 commit into
QuEST-Kit:develfrom
LubuSeb:codex/unitaryhack-quest-598-compensated
Open

Improve applyCompMatr summation accuracy#781
LubuSeb wants to merge 1 commit into
QuEST-Kit:develfrom
LubuSeb:codex/unitaryhack-quest-598-compensated

Conversation

@LubuSeb

@LubuSeb LubuSeb commented Jun 5, 2026

Copy link
Copy Markdown

Closes #598.

Summary

  • Replaces the naive inner accumulation in cpu_statevec_anyCtrlAnyTargDenseMatr_sub() with component-wise compensated summation over cpu_qcomp.re and cpu_qcomp.im.
  • Writes each output amplitude once after its local reduction rather than repeatedly updating amps[i] inside the summation loop.
  • Adds a deterministic complex cancellation regression for 4-target applyCompMatr, which exercises the 3+ target CompMatr path rather than the one/two-target specialisations.

Notes

I saw the earlier closed attempt in #777. This version keeps the same narrow two-file scope, but uses direct qreal compensation for the real and imaginary components instead of relying on complex add/sub overloads in the hot loop. I also checked base_qcomp: the current operators are ordinary component-wise arithmetic, so independent real/imaginary Kahan compensation is compatible with the backend representation.

Local measurements

Configuration: Windows, GCC 13.2.0, Release, single CPU (QUEST_ENABLE_OMP=OFF, QUEST_ENABLE_MPI=OFF, QUEST_ENABLE_CUDA=OFF, QUEST_ENABLE_HIP=OFF). The benchmark applies a dense CompMatr whose first output row is [large+i*large, 1-i, ..., 1-i, -large-i*large] to an all-ones state. The expected first amplitude is (2^targets - 2) - i(2^targets - 2).

precision targets baseline abs error patched abs error baseline avg ms patched avg ms
1 4 14 0 0.0114 0.0117
1 8 254 0 0.0779 0.2100
1 10 1022 0 1.0715 3.0297
2 4 14 0 0.0113 0.0121
2 8 254 0 0.0801 0.2008
2 10 1022 0 1.4134 3.2629
4 4 14 0 0.0137 0.0134
4 8 254 0 0.4391 0.4150
4 10 1022 0 7.1805 6.6778

The measurements show the expected accuracy improvement. The overhead is visible for larger single/double precision reductions, which seems consistent with the tradeoff described in the issue.

Testing

cmake -S . -B build-598-fp2 -G Ninja -D CMAKE_BUILD_TYPE=Release -D QUEST_BUILD_TESTS=ON -D QUEST_FLOAT_PRECISION=2 -D QUEST_ENABLE_OMP=OFF -D QUEST_ENABLE_MPI=OFF -D QUEST_ENABLE_CUDA=OFF -D QUEST_ENABLE_HIP=OFF
cmake --build build-598-fp2 --parallel
build-598-fp2/tests/tests.exe '*applyCompMatr*' --reporter compact
ctest --test-dir build-598-fp2 --output-on-failure -j 4
git diff --check

Results:

  • *applyCompMatr*: passed, 10 test cases / 10,003 assertions.
  • Full CPU-only double-precision ctest: passed.
  • git diff --check: passed; Git emitted only Windows line-ending conversion warnings.

Prepared with AI assistance; I reviewed the patch and ran the listed local checks.

@TysonRayJones

Copy link
Copy Markdown
Member

How come you separate out the real and imaginary components, mr robo?

@LubuSeb

LubuSeb commented Jun 7, 2026

Copy link
Copy Markdown
Author

Good question. I split them because the dense matrix-vector dot product ultimately has two scalar reductions: one for the real component and one for the imaginary component. Kahan compensation is a scalar summation algorithm, so the complex version here is one compensation stream for sum.re / correction.re and one for sum.im / correction.im.

I also wanted to keep the compensation step explicit instead of relying on cpu_qcomp's overloaded + / - operators in that hot loop. The complex multiply is still the existing elem * cache[j]; the split only compensates the accumulation of those already-computed complex products. So this improves cancellation-heavy dense matrix applications, which the regression test targets, but it is not trying to make each complex product higher precision.

I can factor this into a small local helper if you prefer that shape.

@TysonRayJones

Copy link
Copy Markdown
Member

Do you see a change when using the cpu_qcomp overloads? I think the existing bracketing, or breaking up line 631 into two-steps would counteract any issue.

Are you able to please share your benchmarking / accuracy testing code? I note your benchmarks do not indicate the size of the Qureg being processed, which is critical. We expect to occlude these additional arithmetic costs when caching costs are large, because the processed state is large.

@LubuSeb

LubuSeb commented Jun 9, 2026

Copy link
Copy Markdown
Author

Thanks, I rechecked both points and reran the harness with the Qureg size reported separately from the number of targets.

On the overload/bracketing question: I made a temporary comparison build where the old update was split into:

cpu_qcomp term = elem * cache[j];
amps[i] += term;

That still behaves like baseline on the cancellation case. For example, with QUEST_FLOAT_PRECISION=2, targets=8, the baseline and two-step builds both report abs_error=254, while the compensated build reports abs_error=0. The same pattern holds for fp1/fp2/fp4 at targets 4/8/10/12.

So I think the issue is not the multiply temporary or the overloaded += expression shape. The constructed case loses the small real terms in the serial accumulation order large + small... - large. The patch still uses the existing elem * cache[j]; it only compensates the accumulation of those already-computed terms. cpu_qcomp is base_qcomp, and its += is component-wise re += a.re; im += a.im;, so explicit scalar compensation seemed clearest. If you meant splitting the Kahan correction assignment itself into a named temporary, I can do that for readability; under the Release flags below I would expect it to be equivalent.

I also fixed the benchmark harness to take numTargets, numQubits, and repeats, then print state_amps and target_amps. You were right that the previous numbers were all-target, one-block cases and were missing that context.

Local verification:

ctest --output-on-failure -R "apply.*CompMatr"
100% tests passed, 0 tests failed out of 19

Benchmark environment: Windows, GCC 13.2, Release (-O3 -DNDEBUG), OMP/MPI/GPU off.

Accuracy summary:

targets=4,  qubits=4,  state_amps=16,   target_amps=16:   baseline/twostep abs_error=14,   patched abs_error=0
targets=8,  qubits=8,  state_amps=256,  target_amps=256:  baseline/twostep abs_error=254,  patched abs_error=0
targets=10, qubits=10, state_amps=1024, target_amps=1024: baseline/twostep abs_error=1022, patched abs_error=0
targets=12, qubits=12, state_amps=4096, target_amps=4096: baseline/twostep abs_error=4094, patched abs_error=0

That was true for QUEST_FLOAT_PRECISION=1, 2, and 4.

Some larger-state fp2 timing results from the same harness:

targets=4,  qubits=16, state_amps=65536,   target_amps=16:   baseline 1.15 ms, twostep 1.43 ms, patched 2.82 ms
targets=4,  qubits=20, state_amps=1048576, target_amps=16:   baseline 21.90 ms, twostep 24.12 ms, patched 42.03 ms
targets=8,  qubits=16, state_amps=65536,   target_amps=256:  baseline 25.13 ms, twostep 19.02 ms, patched 51.74 ms
targets=8,  qubits=18, state_amps=262144,  target_amps=256:  baseline 85.03 ms, twostep 79.20 ms, patched 213.49 ms
targets=10, qubits=16, state_amps=65536,   target_amps=1024: baseline 168.07 ms, twostep 170.60 ms, patched 253.93 ms

So I would not claim this is cost-free in the CPU/no-OMP configuration I tested. The accuracy improvement is reproducible, but the performance tradeoff is real in this synthetic cancellation-heavy path.

The standalone harness I used:

bench598.cpp
#include "quest.h"

#include <chrono>
#include <cmath>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <limits>
#include <vector>

int main(int argc, char** argv) {
    int numTargets = (argc > 1) ? std::atoi(argv[1]) : 4;
    int numQubits = (argc > 2) ? std::atoi(argv[2]) : numTargets;
    int repeats = (argc > 3) ? std::atoi(argv[3]) : 20;

    if (numQubits < numTargets) {
        std::cerr << "numQubits must be >= numTargets\n";
        return 1;
    }

    initQuESTEnv();
    setQuESTValidationEpsilon(0);

    std::vector<int> targets(numTargets);
    for (int i=0; i<numTargets; i++)
        targets[i] = i;

    qindex numTargAmps = qindex(1) << numTargets;
    qindex numStateAmps = qindex(1) << numQubits;
    std::vector<qcomp> amps(numStateAmps, qcomp(1, 0));

    CompMatr matrix = createCompMatr(numTargets);
    for (qindex i=0; i<matrix.numRows; i++)
        for (qindex j=0; j<matrix.numRows; j++)
            matrix.cpuElems[i][j] = qcomp(0, 0);

    qreal large = std::ldexp(qreal(1), std::numeric_limits<qreal>::digits);
    matrix.cpuElems[0][0] = qcomp( large,  large);
    for (qindex j=1; j<matrix.numRows-1; j++)
        matrix.cpuElems[0][j] = qcomp(1, -1);
    matrix.cpuElems[0][matrix.numRows-1] = qcomp(-large, -large);
    syncCompMatr(matrix);

    Qureg qureg = createQureg(numQubits);
    qcomp amp = qcomp(0, 0);

    for (int r=0; r<2; r++) {
        setQuregAmps(qureg, 0, amps.data(), amps.size());
        applyCompMatr(qureg, targets.data(), numTargets, matrix);
    }

    double totalMs = 0;
    for (int r=0; r<repeats; r++) {
        setQuregAmps(qureg, 0, amps.data(), amps.size());

        auto start = std::chrono::steady_clock::now();
        applyCompMatr(qureg, targets.data(), numTargets, matrix);
        auto end = std::chrono::steady_clock::now();

        totalMs += std::chrono::duration<double, std::milli>(end - start).count();
        amp = getQuregAmp(qureg, 0);
    }

    qreal expectedRe = qreal(numTargAmps - 2);
    qreal expectedIm = -qreal(numTargAmps - 2);
    qreal absErr = std::abs(std::real(amp) - expectedRe) + std::abs(std::imag(amp) - expectedIm);

    std::cout << std::setprecision(21)
              << "precision_bytes=" << sizeof(qreal)
              << " targets=" << numTargets
              << " qubits=" << numQubits
              << " state_amps=" << numStateAmps
              << " target_amps=" << numTargAmps
              << " repeats=" << repeats
              << " avg_ms=" << (totalMs / repeats)
              << " abs_error=" << absErr
              << " amp_re=" << std::real(amp)
              << " amp_im=" << std::imag(amp)
              << "\n";

    destroyQureg(qureg);
    destroyCompMatr(matrix);
    finalizeQuESTEnv();
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants