Improve `applyCompMatr` accuracy with compensated summation

> [!NOTE]
> For all development, please work out of the [`devel`](https://github.com/QuEST-Kit/QuEST/tree/devel) branch

## Summary

Update `applyCompMatr` (specifically subroutine [`cpu_statevec_anyCtrlAnyTargDenseMatr_sub()`](https://github.com/QuEST-Kit/QuEST/blob/8678836cd29ac59688556399860aab712db1fe14/quest/src/cpu/cpu_subroutines.cpp#L483)) to use _compensated summation_ and measure the effect on (single CPU) runtime and accuracy for few and many target qubits.

## Context

The logic of many QuEST simulation functions involve linearly combining a fixed-number of amplitudes:
```python
# parallel
for each pair of amplitudes:
    amp1 -> coeff1 * amp1 + coeff2 * amp2
    amp2 -> coeff1 * amp2 + coeff2 * amp1
``` 
In contrast, functions (like [here](https://github.com/QuEST-Kit/QuEST/blob/53f8f3ad60e5b0171646ee8250c0e6ea65878b44/quest/src/cpu/cpu_subroutines.cpp#L483)) which accept any-sized matrices (`applyCompMatr`, `mixSuperOp` and `mixKrausMap`) have a logic resembling:
```python
batchsize = powerOf2(length(targets))

# parallel
for each batch:
    cache = copy of each amplitude in batch

    for amplitude in batch:
        amplitude = 0

        for other_amplitude in cache:
            amplitude += somecoeff * other_amplitude
```
In such functions, each output amplitude is the linear combination of a _dynamic_ number of input amplitudes; typically the matrix dimension (`2` to the power of the number of target qubits). While the outer loop is parallelised, the inner loops are serially evaluated by a single thread.

In this issue, we consider the numerical precision of performing
```python
        for other_amplitude in cache:
            amplitude += somecoeff * other_amplitude
```
A realistic maximum size of the input `CompMatr` is `16` qubits, requiring `16 GiB` and containing `2^32 = 4,294,967,296 ~ 4B` elements. In this end-regime, this exponentially-large loop is performing a reduction of billions of floating-point complex numbers which may vary enormously in their relative size. As such, the current implementation is _numerically unstable_ and is liable to [catastrophic cancellation](https://en.wikipedia.org/wiki/Catastrophic_cancellation), so that the modification to the state is inaccurate for many qubits. This is likely to break state normalisation and corrupt other state properties.

To remedy this, we could replace that naive sum in [`statevec_anyCtrlAnyTargDenseMatr_sub()`](https://github.com/QuEST-Kit/QuEST/blob/53f8f3ad60e5b0171646ee8250c0e6ea65878b44/quest/src/cpu/cpu_subroutines.cpp#L483) with [_compensated summation_](https://pdf.sciencedirectassets.com/272990/1-s2.0-S1571066115X00089/1-s2.0-S1571066115000481/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjENz%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJHMEUCIQDY%2BjayOiL4ADL5zQXCDUiyq8zl%2FCPLBIZVxu98yiUr4QIgO10CQO%2Fc%2BRrIaoj49xCiIfytf9NNrwXMvMy2qA7i7AsqswUIdRAFGgwwNTkwMDM1NDY4NjUiDFSMrJdr5mu7aN4uuCqQBSdLzwhsd2kwi1gUmMaF9jcMq63T3kNqToTItYboW9sjD9ndlkr61avIf4hdWHdNSgi56nqLUgat5dAn9NJJhtdfF5mIDFhUEATAUPsGu2xg3B3WcGS8AXlMalZTwolV9J1sE8uvV867XIVGXan5W63gLWCTG3korid2pA5DY%2Bqimq1gV6I0PxGOEgbJwVwBI1wTFR4xppvg3oAln4sVt7Iyq%2BWcD5hoLHVuPYZ1wuWpmAaf9pMnI84uFuk04%2F9ge7InHyWimQfg%2BMQtIJ08OuNSGD62mVcnekwZUoaxNriSgGTQdwS25ukPukuSmAryC3mDImufOlStdIAK3M2i7JKHS1q37qk%2FV9RGWvkqQGxo1mAVUws9DNVMC3zJBVfbaH4O3z6rvYcu3XapGDR9VKBJ66YwbtzfyFaRX%2BFHhk9IT9JH1c1PWaBf8SeqQM2BM7JMl8UfZQOt4Vz2twwHTReLwxXp2qpoaceQMP95LWy8OCJ7XM5RoYBQzpITaAvyV7kDrhW7TQXJwwEEPTA3Ya3OQSZfZRZJKT2nO61etUWaLoeXFo%2FyJRxMQ4CCqyAwSgsWZIZJrBzdS5IRveNKZL45X7fE98YADb3q%2FFCYiONMAYpeqT1xEo8AHpykfqvRQ%2BS32NTFs8Qyt4nthvh8yQZ1ZYJREC13Hjt3ws42tjaY9Ih7aeCPSM0OQwrt5FOvjz87KkLqu2joXDkJAl8lno9CIdgi8nHm8TiI%2FfEn4NjZiVKuPc%2F4getZVczreW2UBS8LOeaN3I7r3IhmcZfhtFAOWW98TNGw8y%2FWK2Al%2FKUHcfsgB42rvq9FnDGgWvBmBoDqDEv%2FQDqkP%2BsQ%2FUZLmYJzNJ1UflYptzcQjGuXf1QtMK3OvcAGOrEBDJaAkk77KE0KfD6Q1TOAr9HrWvISbjRgKKoV1SfvNQaHmU%2Fz%2FpfSOhoQo7M%2FNLM%2FynTG0PXOtAir1sE0oAaGGvSp2Pwdn7A1mhMHwBfoC8q1k5a1SHIZXs3QioLS5Ue%2Fp3ymXew9Nyha2avF%2B4%2BS%2BMOs5z0BjDsGnRJTx7d6EqjgtnShrSlD3K7v3Z5WnVJ2oPz6zSjDgUhRyuEcyuKKIv%2Flcxn9kBmkOIZHRs9YTTMY&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20250428T114813Z&X-Amz-SignedHeaders=host&X-Amz-Expires=300&X-Amz-Credential=ASIAQ3PHCVTYRICIRH7Y%2F20250428%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=fb3ba7b46c141b33ca79bc5e3d5844777a87970c1259327e366b8cb1d9526a31&hash=3149a5fb8e0649b5c951c96a581564b981c508c079ca54177a24f8ebc84d4ca6&host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&pii=S1571066115000481&tid=spdf-72aa6922-560d-48de-9e9b-e05f9bf1e94b&sid=07ea601467fe234a5f084ed2a06c6aa007aegxrqb&type=client&tsoh=d3d3LnNjaWVuY2VkaXJlY3QuY29t&rh=d3d3LnNjaWVuY2VkaXJlY3QuY29t&ua=190b5d5c565957565003&rr=9376544d1ab33b64&cc=ch), like [Kahan summation](https://en.wikipedia.org/wiki/Kahan_summation_algorithm), to improve the numerical accuracy of the final amplitudes. An example of Kahan summation of `qcomp` (QuEST's complex primitive type) already exists in the unit test utilities [here](https://github.com/QuEST-Kit/QuEST/blob/53f8f3ad60e5b0171646ee8250c0e6ea65878b44/tests/utils/linalg.cpp#L115).  

However, switching to compensated summation might _not_ be worthwhile. Kahan summation involves more total operations than naive summation, slowing things down and potentially jeopardising optimisations like auto-vectorisation. This might be evident for small `Qureg` _before_ simulation becomes [memory bandwidth bound](https://en.wikipedia.org/wiki/Memory-bound_function). The benefit to accuracy might also be negligible until the number of target qubits is large, gratuitously slowing few-target simulation. It might be better to use compensation only above some threshold, or perhaps the benefit is never worth the increased code complexity.


## Details

Modify [`cpu_statevec_anyCtrlAnyTargDenseMatr_sub()`](https://github.com/QuEST-Kit/QuEST/blob/8678836cd29ac59688556399860aab712db1fe14/quest/src/cpu/cpu_subroutines.cpp#L483)) to use Kahan summation and measure the runtime and accuracy effect (compared to the current implementation _without compensation_) at different system sizes. Visualise the results to show the costs and benefits of Kahan summation, elucidating when (if ever) it is worthwhile. Repeat this for the three possible `qcomp` precisions (see [here](https://github.com/QuEST-Kit/QuEST/blob/main/docs/compile.md#compile_precision)).

> It is sufficient to test only in single-CPU settings.


## Testing

A quick and dirty way to see the effect on the accuracy is to apply a large `CompMatr` upon a fixed state (with and without compensation) and compare a resulting amplitude. For example:
```C++
#include "quest.h"

int main() {
    initQuESTEnv();

    // choose the matrix size, with arbitrary target qubits
    int numTargets = 13;
    int targets[] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20};

    // create a matrix (elements are unimportant; we'll disable unitarity validation)
    CompMatr matrix = createCompMatr(numTargets);
    for (qindex i=0; i<matrix.numRows; i++)
        for (qindex j=0; j<matrix.numRows; j++)
            matrix.cpuElems[i][j] = 1;
    syncCompMatr(matrix);

    // create a random Qureg (fixed between executions) which need not be
    // any bigger than the operator matrix in order to test accuracy
    unsigned seed = 123456789u;
    setSeeds(&seed, 1);
    Qureg qureg = createQureg(numTargets);
    initRandomPureState(qureg);

    // apply the matrix and obtain the effect on the first amplitude
    setValidationEpsilon(0);
    applyCompMatr(qureg, targets, numTargets, matrix);
    qcomp amp = getQuregAmp(qureg, 0);

    // report the amplitude at a large precision
    setMaxNumReportedSigFigs(20);
    reportScalar("amp", amp);

    destroyCompMatr(matrix);
    destroyQureg(qureg);
    finalizeQuESTEnv();
    return 0;
}
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `applyCompMatr` accuracy with compensated summation #598

Summary

Context

Details

Testing

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improve applyCompMatr accuracy with compensated summation #598

Description

Summary

Context

Details

Testing

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Improve `applyCompMatr` accuracy with compensated summation #598