Improve applyCompMatr summation accuracy#781
Conversation
|
How come you separate out the real and imaginary components, mr robo? |
|
Good question. I split them because the dense matrix-vector dot product ultimately has two scalar reductions: one for the real component and one for the imaginary component. Kahan compensation is a scalar summation algorithm, so the complex version here is one compensation stream for I also wanted to keep the compensation step explicit instead of relying on I can factor this into a small local helper if you prefer that shape. |
|
Do you see a change when using the Are you able to please share your benchmarking / accuracy testing code? I note your benchmarks do not indicate the size of the |
|
Thanks, I rechecked both points and reran the harness with the Qureg size reported separately from the number of targets. On the overload/bracketing question: I made a temporary comparison build where the old update was split into: cpu_qcomp term = elem * cache[j];
amps[i] += term;That still behaves like baseline on the cancellation case. For example, with So I think the issue is not the multiply temporary or the overloaded I also fixed the benchmark harness to take Local verification: Benchmark environment: Windows, GCC 13.2, Release ( Accuracy summary: That was true for Some larger-state fp2 timing results from the same harness: So I would not claim this is cost-free in the CPU/no-OMP configuration I tested. The accuracy improvement is reproducible, but the performance tradeoff is real in this synthetic cancellation-heavy path. The standalone harness I used: bench598.cpp#include "quest.h"
#include <chrono>
#include <cmath>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <limits>
#include <vector>
int main(int argc, char** argv) {
int numTargets = (argc > 1) ? std::atoi(argv[1]) : 4;
int numQubits = (argc > 2) ? std::atoi(argv[2]) : numTargets;
int repeats = (argc > 3) ? std::atoi(argv[3]) : 20;
if (numQubits < numTargets) {
std::cerr << "numQubits must be >= numTargets\n";
return 1;
}
initQuESTEnv();
setQuESTValidationEpsilon(0);
std::vector<int> targets(numTargets);
for (int i=0; i<numTargets; i++)
targets[i] = i;
qindex numTargAmps = qindex(1) << numTargets;
qindex numStateAmps = qindex(1) << numQubits;
std::vector<qcomp> amps(numStateAmps, qcomp(1, 0));
CompMatr matrix = createCompMatr(numTargets);
for (qindex i=0; i<matrix.numRows; i++)
for (qindex j=0; j<matrix.numRows; j++)
matrix.cpuElems[i][j] = qcomp(0, 0);
qreal large = std::ldexp(qreal(1), std::numeric_limits<qreal>::digits);
matrix.cpuElems[0][0] = qcomp( large, large);
for (qindex j=1; j<matrix.numRows-1; j++)
matrix.cpuElems[0][j] = qcomp(1, -1);
matrix.cpuElems[0][matrix.numRows-1] = qcomp(-large, -large);
syncCompMatr(matrix);
Qureg qureg = createQureg(numQubits);
qcomp amp = qcomp(0, 0);
for (int r=0; r<2; r++) {
setQuregAmps(qureg, 0, amps.data(), amps.size());
applyCompMatr(qureg, targets.data(), numTargets, matrix);
}
double totalMs = 0;
for (int r=0; r<repeats; r++) {
setQuregAmps(qureg, 0, amps.data(), amps.size());
auto start = std::chrono::steady_clock::now();
applyCompMatr(qureg, targets.data(), numTargets, matrix);
auto end = std::chrono::steady_clock::now();
totalMs += std::chrono::duration<double, std::milli>(end - start).count();
amp = getQuregAmp(qureg, 0);
}
qreal expectedRe = qreal(numTargAmps - 2);
qreal expectedIm = -qreal(numTargAmps - 2);
qreal absErr = std::abs(std::real(amp) - expectedRe) + std::abs(std::imag(amp) - expectedIm);
std::cout << std::setprecision(21)
<< "precision_bytes=" << sizeof(qreal)
<< " targets=" << numTargets
<< " qubits=" << numQubits
<< " state_amps=" << numStateAmps
<< " target_amps=" << numTargAmps
<< " repeats=" << repeats
<< " avg_ms=" << (totalMs / repeats)
<< " abs_error=" << absErr
<< " amp_re=" << std::real(amp)
<< " amp_im=" << std::imag(amp)
<< "\n";
destroyQureg(qureg);
destroyCompMatr(matrix);
finalizeQuESTEnv();
} |
Closes #598.
Summary
cpu_statevec_anyCtrlAnyTargDenseMatr_sub()with component-wise compensated summation overcpu_qcomp.reandcpu_qcomp.im.amps[i]inside the summation loop.applyCompMatr, which exercises the 3+ targetCompMatrpath rather than the one/two-target specialisations.Notes
I saw the earlier closed attempt in #777. This version keeps the same narrow two-file scope, but uses direct
qrealcompensation for the real and imaginary components instead of relying on complex add/sub overloads in the hot loop. I also checkedbase_qcomp: the current operators are ordinary component-wise arithmetic, so independent real/imaginary Kahan compensation is compatible with the backend representation.Local measurements
Configuration: Windows, GCC 13.2.0, Release, single CPU (
QUEST_ENABLE_OMP=OFF,QUEST_ENABLE_MPI=OFF,QUEST_ENABLE_CUDA=OFF,QUEST_ENABLE_HIP=OFF). The benchmark applies a denseCompMatrwhose first output row is[large+i*large, 1-i, ..., 1-i, -large-i*large]to an all-ones state. The expected first amplitude is(2^targets - 2) - i(2^targets - 2).The measurements show the expected accuracy improvement. The overhead is visible for larger single/double precision reductions, which seems consistent with the tradeoff described in the issue.
Testing
Results:
*applyCompMatr*: passed, 10 test cases / 10,003 assertions.ctest: passed.git diff --check: passed; Git emitted only Windows line-ending conversion warnings.Prepared with AI assistance; I reviewed the patch and ran the listed local checks.