fix(sum): use pairwise summation for floating-point linalg::Sum by IvanaGyro · Pull Request #849 · Cytnx-dev/Cytnx

IvanaGyro · 2026-05-29T04:08:06Z

Fixes #835.

Problem

linalg::Sum accumulated every element into one scalar in a serial loop, whose worst-case rounding error grows as O(N · eps). The Accuracy test in tests/linalg_test/sum_test.cpp (10000 identical floating-point values whose bit patterns are chosen to expose round-off) failed under naive accumulation and was therefore disabled.

Change

Add a strided pairwise-summation primitive src/backend/linalg_internal_cpu/pairwise_sum.hpp:
- StrideView — a C++20 random-access, sized view over every k-th element of a range.
- PairwiseSum — NumPy np.add.reduce-style summation: straight loop below 8 elements, an 8-accumulator unrolled block up to 128, and a halve-and-recurse split (rounded to a multiple of 8) above that. Worst-case error growth drops to O(log N · eps) at essentially the same cost.
Rewrite the Float/Double/ComplexFloat/ComplexDouble branches of Sum_internal to use PairwiseSum. Integer branches keep the straight loop (they do not round).
Re-enable the accuracy test (DISABLED_Accuracy → Accuracy).
The GPU path (cuReduce_gpu) already performs a block/tree reduction with bounded error, so it is unchanged and its GpuAccuracy test stays enabled.

StrideView is intentionally reusable for strided reductions (e.g. a matrix diagonal); the follow-up Trace refactor (#834) builds on it.

Test plan

openblas-cpu: full suite 948/948 passed (includes the 10 re-enabled LinalgSumHomogeneousValuesTest.Accuracy<...> cases).
mkl-cpu: full suite 948/948 passed.
For the previously-disabled values the new code lands within 2 ULP of the reference (EXPECT_NUMBER_EQ tolerates 4 ULP); naive accumulation was 108 ULP off.
GPU CI (unchanged GPU path).

🤖 Draft — opened for review.

Generated by Claude Code

gemini-code-assist

Code Review

This pull request replaces naive summation loops in Sum_internal.cpp with a recursive pairwise summation algorithm (PairwiseSum) implemented in pairwise_sum.hpp to improve floating-point accuracy, and enables the corresponding accuracy tests. The review feedback highlights a critical safety issue in StrideView::Iterator where storing a pointer to the view can lead to a dangling pointer, and recommends storing the underlying range's iterator instead, along with updating the begin() and end() methods.

codecov · 2026-05-29T04:42:03Z

Codecov Report

❌ Patch coverage is 96.87500% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 28.70%. Comparing base (9f0e8fb) to head (3a5a680).
⚠️ Report is 192 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/backend/linalg_internal_cpu/pairwise_sum.hpp	96.42%	0 Missing and 1 partial ⚠️

❗ There is a different number of reports uploaded between BASE (9f0e8fb) and HEAD (3a5a680). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (9f0e8fb) HEAD (3a5a680)

1 0

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #849      +/-   ##
==========================================
- Coverage   36.89%   28.70%   -8.20%     
==========================================
  Files         213      242      +29     
  Lines       31032    35272    +4240     
  Branches    12738    14628    +1890     
==========================================
- Hits        11449    10124    -1325     
- Misses      17719    18161     +442     
- Partials     1864     6987    +5123

Flag	Coverage Δ
cpp	`28.31% <96.87%> (?)`
python	`51.07% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
C++ backend	`29.88% <76.19%> (∅)`
Python bindings	`16.89% <ø> (∅)`
Python package	`51.07% <ø> (∅)`

Files with missing lines	Coverage Δ
src/backend/linalg_internal_cpu/Sum_internal.cpp	`96.29% <100.00%> (+86.61%)`	⬆️
src/backend/linalg_internal_cpu/pairwise_sum.hpp	`96.42% <96.42%> (ø)`

... and 156 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d2a0d31...3a5a680. Read the comment docs.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

IvanaGyro

(Review written by Claude on behalf of @IvanaGyro.)

The pairwise-summation core looks correct to me. PairwiseSumBlocks covers every element exactly once across the three regimes (<8 straight loop, the 8-accumulator block for [8,128] with its scalar tail, and the n>128 split where half is rounded down to a multiple of 8 and is always ≥64), and the reduction order matches NumPy's np.add.reduce. Empty ranges return T(0), preserving the old behavior. The Sum_internal rewrite and the test re-enable are clean.

One additive note on top of the existing comments — see the inline comment about StrideView being unexercised in this PR.

IvanaGyro · 2026-05-29T12:03:26Z

(Comment written by Claude on behalf of @IvanaGyro.)

Test-coverage note (in addition to the StrideView point in the review): the re-enabled Accuracy test sums 10000 identical values. That exercises the recursive split, but not the block-size boundaries nor mixed-magnitude inputs — which is exactly where pairwise summation diverges from naive accumulation. Consider a small direct PairwiseSum unit test covering n = 0, 1, 7, 8, 9, 128, 129 with a mix of large/small/negative values, compared against a higher-precision (e.g. long double / Kahan) reference, so the 8-accumulator block edge and the halve-and-recurse boundary are pinned and can't silently regress.

IvanaGyro

(Review written by Claude on behalf of @IvanaGyro.)

Thanks for addressing the earlier round — stride_view/iterator naming, the dangling-pointer fix (iterator now stores the base iterator, with a dedicated IteratorsOutliveTheView test), the move to its own stride_view.hpp, the | stride(k) pipe adaptor, and the static_asserts on the range concepts all look good, and the fix is correct. One blocking issue: the new test file isn't registered in the build — details inline.

linalg::Sum accumulated every element into one scalar in a serial loop, whose worst-case rounding error grows as O(N * eps). The Accuracy test in tests/linalg_test/sum_test.cpp (10000 identical floating-point values whose bit patterns are chosen to expose round-off) failed under naive accumulation and was disabled. Add a pairwise-summation primitive (pairwise_sum.hpp): PairwiseSum mirrors NumPy's np.add.reduce -- a straight loop below 8 elements, an eight-accumulator unrolled block up to 128, and a halve-and-recurse split rounded to a multiple of eight above that. Worst-case error growth drops to O(log N * eps) at essentially the same cost. PairwiseSum consumes any random-access range, so it also serves strided reductions later. Rewrite the Float/Double/ComplexFloat/ComplexDouble branches of Sum_internal to call PairwiseSum over the contiguous storage span and re-enable the accuracy test. Integer branches keep the straight loop since they do not round. The GPU path (cuReduce_gpu) already performs a block/tree reduction with bounded error, so it is left unchanged. For the disabled test's values the new code lands within 2 ULP of the reference, against the 4-ULP tolerance of EXPECT_NUMBER_EQ; naive accumulation was 108 ULP off. Co-authored-by: Claude <noreply@anthropic.com>

pcchen · 2026-05-31T02:44:42Z

PR #849 Review: `fix(sum)` — pairwise summation for floating-point `linalg::Sum`

Posted by Claude Code on behalf of @pcchen

Overview

This PR replaces the naive serial accumulation loop in Sum_internal (float/double/complexfloat/complexdouble) with a pairwise summation algorithm, reducing worst-case rounding error from O(N·eps) to O(log N·eps). The re-enabled Accuracy test confirms the fix works in practice (within 2 ULP vs. 108 ULP for naive).

The algorithm mirrors NumPy's np.add.reduce:

n < 8: straight loop
8 ≤ n ≤ 128: 8-accumulator unrolled block + tail
n > 128: split at n/2 rounded down to a multiple of 8, recurse

Correctness

The algorithm is correct and the boundary cases are handled:

n = 0: falls into n < 8 path, loop doesn't execute, returns T(0). ✅
n = 8: initialises r0..r7, inner loop skipped (i+8 > n), tail skipped, returns tree sum of 8 accumulators. ✅
n = 9: inner loop skipped, tail adds first[8] to the tree sum. ✅
Complex types: T(0) constructs {0.0, 0.0} for std::complex<T>, all arithmetic works element-wise. ✅
Half-rounding can't produce zero: for n > 128, half = n/2 ≥ 64, and half - half%8 ≥ 56 > 0, so the recursion always terminates. ✅
Recursion depth: O(log₂(N/128)) — for N = 1 billion, depth ≈ 23, no stack concern. ✅

Issues

1. C++20 requirement needs explicit confirmation (Important)

pairwise_sum.hpp requires C++20:

#include <ranges> — C++20
std::ranges::random_access_range, std::ranges::range_value_t, std::ranges::begin, std::ranges::size — C++20 concepts/functions
std::random_access_iterator — C++20 concept
std::span in Sum_internal.cpp — C++20

The PR passes the full test suite, so the build system already compiles with C++20. But if CMakeLists.txt hasn't been updated to require C++20 (e.g., set(CMAKE_CXX_STANDARD 20)), a clean build on a C++17-defaulting toolchain will fail. Please confirm CMAKE_CXX_STANDARD is already set to ≥ 20, or update it in this PR.

2. Comment references non-existent `stride_view.hpp` (Minor)

// pass a strided view (see stride_view.hpp) to sum a strided sequence such as a matrix diagonal.

stride_view.hpp does not exist yet — it's deferred to the follow-up Trace PR (#834). A reader encountering this comment today will search for a file that isn't there. Change to:

// pass a strided view (see issue #834) to sum a strided sequence such as a matrix diagonal.

or remove the parenthetical until #834 lands.

3. Missing `<complex>` include in `pairwise_sum.hpp` (Minor)

T(0) for complex types requires std::complex to be visible. Currently this is satisfied transitively via the callers, but pairwise_sum.hpp should be self-contained. Add #include <complex> to the header.

4. Why `half` must be a multiple of 8 — comment is missing (Minor)

std::size_t half = n / 2;
half -= half % 8;

This alignment ensures child calls can enter the unrolled path, but there's no comment explaining it. A brief note like // align to block size so the child call enters the unrolled path would help future readers.

Strengths

Clean, self-contained header with a clear algorithm structure.
Handles empty, small, medium, and large inputs correctly.
The PairwiseSum wrapper accepts any random-access range, enabling future reuse for strided reductions without changing the core algorithm.
Re-enabling the Accuracy test (rather than just removing DISABLED_) is the right approach — it now serves as a permanent regression guard.
Integer types correctly keep the straight loop (no rounding needed).
GPU path is correctly left unchanged.

Recommendation

Approve after addressing:

Confirm (or add) CMAKE_CXX_STANDARD 20 in CMakeLists.txt.
Fix the dangling stride_view.hpp reference in the comment.
(Optional) Add #include <complex> to pairwise_sum.hpp and a comment on the half % 8 alignment.

IvanaGyro

(Review written by Claude on behalf of @IvanaGyro.)

The switch to NumPy-style pairwise summation is a clean, well-documented improvement, and re-enabling the previously DISABLED_Accuracy test is the right move. The implementation looks correct: the empty range returns T(0), the <8 / <=128 / recursive-split branches are well-formed, and the powers-of-two block structure is what lets the homogeneous-value test now pass exactly. Two test-coverage gaps below; no blocking concerns with the production code itself.

IvanaGyro · 2026-05-31T07:03:10Z

-  TYPED_TEST(LinalgSumHomogeneousValuesTest, DISABLED_Accuracy) {
+  TYPED_TEST(LinalgSumHomogeneousValuesTest, Accuracy) {
    TypeParam value = LinalgSumHomogeneousValuesTest<TypeParam>::value;
    int element_number = 10000;


(Review written by Claude on behalf of @IvanaGyro.)

This single element_number = 10000 case only exercises the recursive-split branch (plus the 8-accumulator base). The two boundary branches in PairwiseSumBlocks go untested: the serial n < 8 path and the 8 <= n <= 128 unrolled path, including the off-by-one tails (n not a multiple of 8). A regression that broke either branch would still pass CI.

Suggest adding sizes that straddle the thresholds, e.g. 7, 8, 9, 128, 129, in addition to 10000 — a small value-parameterized companion would cover all three code paths cheaply.

IvanaGyro · 2026-05-31T07:03:10Z

   * This test also assesses the accuracy of summing floating-point numbers.
   */
-  TYPED_TEST(LinalgSumHomogeneousValuesTest, DISABLED_Accuracy) {
+  TYPED_TEST(LinalgSumHomogeneousValuesTest, Accuracy) {


(Review written by Claude on behalf of @IvanaGyro.)

All the FP accuracy assertions here use homogeneous (identical) values, the one input distribution where even naive serial summation stays nearly exact. That doesn't demonstrate or guard the actual benefit of pairwise summation — improved accuracy under a large dynamic range of magnitudes.

Consider one heterogeneous-magnitude test for cytnx_double/cytnx_float where naive accumulation visibly loses precision but pairwise recovers it, e.g. a tensor like [1e16, 1, 1, ..., 1, -1e16] (the 1s vanish in a naive running sum once it reaches 1e16, but the balanced tree keeps them). That both documents intent and locks in the O(log N) error behavior.

IvanaGyro mentioned this pull request May 29, 2026

refactor(trace): implement the GPU Tensor trace as CUDA kernels #850

Draft

3 tasks

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated

IvanaGyro commented May 29, 2026

View reviewed changes

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated

IvanaGyro commented May 29, 2026

View reviewed changes

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated

IvanaGyro force-pushed the claude/835-pairwise-sum branch from e355381 to 4c76053 Compare May 29, 2026 07:53

IvanaGyro commented May 29, 2026

View reviewed changes

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated

IvanaGyro mentioned this pull request May 29, 2026

ci(coverage): build coverage on push to master so Codecov compares the right base #854

Merged

1 task

IvanaGyro force-pushed the claude/835-pairwise-sum branch from 4c76053 to 38fdf4f Compare May 29, 2026 23:12

IvanaGyro commented May 29, 2026

View reviewed changes

Comment thread tests/linalg_test/stride_view_test.cpp Outdated

IvanaGyro commented May 30, 2026

View reviewed changes

Comment thread src/backend/linalg_internal_cpu/stride_view.hpp Outdated

IvanaGyro force-pushed the claude/835-pairwise-sum branch from 38fdf4f to 3a5a680 Compare May 30, 2026 09:57

pcchen added this to the v1.1.0 milestone May 31, 2026

pcchen mentioned this pull request May 31, 2026

linalg::Sum accumulates floating-point error; use pairwise summation like NumPy #835

Open

This was referenced May 31, 2026

feat(stride_view): add a strided random-access range adaptor #860

Draft

refactor(trace): pass scalar Trace internal args by value and unify naming #861

Closed

refactor(trace): compute the CPU Tensor trace in the container layer #862

Draft

IvanaGyro commented May 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sum): use pairwise summation for floating-point linalg::Sum#849

fix(sum): use pairwise summation for floating-point linalg::Sum#849
IvanaGyro wants to merge 1 commit into
masterfrom
claude/835-pairwise-sum

IvanaGyro commented May 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

IvanaGyro left a comment

Uh oh!

Uh oh!

IvanaGyro commented May 29, 2026

Uh oh!

IvanaGyro left a comment

Uh oh!

Uh oh!

Uh oh!

pcchen commented May 31, 2026

Uh oh!

IvanaGyro left a comment

Uh oh!

IvanaGyro May 31, 2026

Uh oh!

IvanaGyro May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

IvanaGyro commented May 29, 2026

Problem

Change

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

IvanaGyro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

IvanaGyro commented May 29, 2026

Uh oh!

IvanaGyro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pcchen commented May 31, 2026

PR #849 Review: fix(sum) — pairwise summation for floating-point linalg::Sum

Overview

Correctness

Issues

1. C++20 requirement needs explicit confirmation (Important)

2. Comment references non-existent stride_view.hpp (Minor)

3. Missing <complex> include in pairwise_sum.hpp (Minor)

4. Why half must be a multiple of 8 — comment is missing (Minor)

Strengths

Recommendation

Uh oh!

IvanaGyro left a comment

Choose a reason for hiding this comment

Uh oh!

IvanaGyro May 31, 2026

Choose a reason for hiding this comment

Uh oh!

IvanaGyro May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 29, 2026 •

edited

Loading

PR #849 Review: `fix(sum)` — pairwise summation for floating-point `linalg::Sum`

2. Comment references non-existent `stride_view.hpp` (Minor)

3. Missing `<complex>` include in `pairwise_sum.hpp` (Minor)

4. Why `half` must be a multiple of 8 — comment is missing (Minor)