Skip to content

fix(sum): use pairwise summation for floating-point linalg::Sum#849

Draft
IvanaGyro wants to merge 1 commit into
masterfrom
claude/835-pairwise-sum
Draft

fix(sum): use pairwise summation for floating-point linalg::Sum#849
IvanaGyro wants to merge 1 commit into
masterfrom
claude/835-pairwise-sum

Conversation

@IvanaGyro
Copy link
Copy Markdown
Collaborator

Fixes #835.

Problem

linalg::Sum accumulated every element into one scalar in a serial loop, whose worst-case rounding error grows as O(N · eps). The Accuracy test in tests/linalg_test/sum_test.cpp (10000 identical floating-point values whose bit patterns are chosen to expose round-off) failed under naive accumulation and was therefore disabled.

Change

  • Add a strided pairwise-summation primitive src/backend/linalg_internal_cpu/pairwise_sum.hpp:
    • StrideView — a C++20 random-access, sized view over every k-th element of a range.
    • PairwiseSum — NumPy np.add.reduce-style summation: straight loop below 8 elements, an 8-accumulator unrolled block up to 128, and a halve-and-recurse split (rounded to a multiple of 8) above that. Worst-case error growth drops to O(log N · eps) at essentially the same cost.
  • Rewrite the Float/Double/ComplexFloat/ComplexDouble branches of Sum_internal to use PairwiseSum. Integer branches keep the straight loop (they do not round).
  • Re-enable the accuracy test (DISABLED_AccuracyAccuracy).
  • The GPU path (cuReduce_gpu) already performs a block/tree reduction with bounded error, so it is unchanged and its GpuAccuracy test stays enabled.

StrideView is intentionally reusable for strided reductions (e.g. a matrix diagonal); the follow-up Trace refactor (#834) builds on it.

Test plan

  • openblas-cpu: full suite 948/948 passed (includes the 10 re-enabled LinalgSumHomogeneousValuesTest.Accuracy<...> cases).
  • mkl-cpu: full suite 948/948 passed.
  • For the previously-disabled values the new code lands within 2 ULP of the reference (EXPECT_NUMBER_EQ tolerates 4 ULP); naive accumulation was 108 ULP off.
  • GPU CI (unchanged GPU path).

🤖 Draft — opened for review.


Generated by Claude Code

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces naive summation loops in Sum_internal.cpp with a recursive pairwise summation algorithm (PairwiseSum) implemented in pairwise_sum.hpp to improve floating-point accuracy, and enables the corresponding accuracy tests. The review feedback highlights a critical safety issue in StrideView::Iterator where storing a pointer to the view can lead to a dangling pointer, and recommends storing the underlying range's iterator instead, along with updating the begin() and end() methods.

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated
Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

❌ Patch coverage is 96.87500% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 28.70%. Comparing base (9f0e8fb) to head (3a5a680).
⚠️ Report is 192 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/backend/linalg_internal_cpu/pairwise_sum.hpp 96.42% 0 Missing and 1 partial ⚠️

❗ There is a different number of reports uploaded between BASE (9f0e8fb) and HEAD (3a5a680). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (9f0e8fb) HEAD (3a5a680)
1 0
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #849      +/-   ##
==========================================
- Coverage   36.89%   28.70%   -8.20%     
==========================================
  Files         213      242      +29     
  Lines       31032    35272    +4240     
  Branches    12738    14628    +1890     
==========================================
- Hits        11449    10124    -1325     
- Misses      17719    18161     +442     
- Partials     1864     6987    +5123     
Flag Coverage Δ
cpp 28.31% <96.87%> (?)
python 51.07% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
C++ backend 29.88% <76.19%> (∅)
Python bindings 16.89% <ø> (∅)
Python package 51.07% <ø> (∅)
Files with missing lines Coverage Δ
src/backend/linalg_internal_cpu/Sum_internal.cpp 96.29% <100.00%> (+86.61%) ⬆️
src/backend/linalg_internal_cpu/pairwise_sum.hpp 96.42% <96.42%> (ø)

... and 156 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d2a0d31...3a5a680. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated
Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated
@IvanaGyro IvanaGyro force-pushed the claude/835-pairwise-sum branch from e355381 to 4c76053 Compare May 29, 2026 07:53
Copy link
Copy Markdown
Collaborator Author

@IvanaGyro IvanaGyro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Review written by Claude on behalf of @IvanaGyro.)

The pairwise-summation core looks correct to me. PairwiseSumBlocks covers every element exactly once across the three regimes (<8 straight loop, the 8-accumulator block for [8,128] with its scalar tail, and the n>128 split where half is rounded down to a multiple of 8 and is always ≥64), and the reduction order matches NumPy's np.add.reduce. Empty ranges return T(0), preserving the old behavior. The Sum_internal rewrite and the test re-enable are clean.

One additive note on top of the existing comments — see the inline comment about StrideView being unexercised in this PR.

Comment thread src/backend/linalg_internal_cpu/pairwise_sum.hpp Outdated
@IvanaGyro
Copy link
Copy Markdown
Collaborator Author

(Comment written by Claude on behalf of @IvanaGyro.)

Test-coverage note (in addition to the StrideView point in the review): the re-enabled Accuracy test sums 10000 identical values. That exercises the recursive split, but not the block-size boundaries nor mixed-magnitude inputs — which is exactly where pairwise summation diverges from naive accumulation. Consider a small direct PairwiseSum unit test covering n = 0, 1, 7, 8, 9, 128, 129 with a mix of large/small/negative values, compared against a higher-precision (e.g. long double / Kahan) reference, so the 8-accumulator block edge and the halve-and-recurse boundary are pinned and can't silently regress.

Copy link
Copy Markdown
Collaborator Author

@IvanaGyro IvanaGyro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Review written by Claude on behalf of @IvanaGyro.)

Thanks for addressing the earlier round — stride_view/iterator naming, the dangling-pointer fix (iterator now stores the base iterator, with a dedicated IteratorsOutliveTheView test), the move to its own stride_view.hpp, the | stride(k) pipe adaptor, and the static_asserts on the range concepts all look good, and the fix is correct. One blocking issue: the new test file isn't registered in the build — details inline.

Comment thread tests/linalg_test/stride_view_test.cpp Outdated
Comment thread src/backend/linalg_internal_cpu/stride_view.hpp Outdated
linalg::Sum accumulated every element into one scalar in a serial loop,
whose worst-case rounding error grows as O(N * eps). The Accuracy test in
tests/linalg_test/sum_test.cpp (10000 identical floating-point values whose
bit patterns are chosen to expose round-off) failed under naive accumulation
and was disabled.

Add a pairwise-summation primitive (pairwise_sum.hpp): PairwiseSum mirrors
NumPy's np.add.reduce -- a straight loop below 8 elements, an eight-accumulator
unrolled block up to 128, and a halve-and-recurse split rounded to a multiple
of eight above that. Worst-case error growth drops to O(log N * eps) at
essentially the same cost. PairwiseSum consumes any random-access range, so it
also serves strided reductions later.

Rewrite the Float/Double/ComplexFloat/ComplexDouble branches of Sum_internal to
call PairwiseSum over the contiguous storage span and re-enable the accuracy
test. Integer branches keep the straight loop since they do not round. The GPU
path (cuReduce_gpu) already performs a block/tree reduction with bounded error,
so it is left unchanged.

For the disabled test's values the new code lands within 2 ULP of the
reference, against the 4-ULP tolerance of EXPECT_NUMBER_EQ; naive accumulation
was 108 ULP off.

Co-authored-by: Claude <noreply@anthropic.com>
@IvanaGyro IvanaGyro force-pushed the claude/835-pairwise-sum branch from 38fdf4f to 3a5a680 Compare May 30, 2026 09:57
@pcchen pcchen added this to the v1.1.0 milestone May 31, 2026
@pcchen
Copy link
Copy Markdown
Collaborator

pcchen commented May 31, 2026

PR #849 Review: fix(sum) — pairwise summation for floating-point linalg::Sum

Posted by Claude Code on behalf of @pcchen

Overview

This PR replaces the naive serial accumulation loop in Sum_internal (float/double/complexfloat/complexdouble) with a pairwise summation algorithm, reducing worst-case rounding error from O(N·eps) to O(log N·eps). The re-enabled Accuracy test confirms the fix works in practice (within 2 ULP vs. 108 ULP for naive).

The algorithm mirrors NumPy's np.add.reduce:

  • n < 8: straight loop
  • 8 ≤ n ≤ 128: 8-accumulator unrolled block + tail
  • n > 128: split at n/2 rounded down to a multiple of 8, recurse

Correctness

The algorithm is correct and the boundary cases are handled:

  • n = 0: falls into n < 8 path, loop doesn't execute, returns T(0). ✅
  • n = 8: initialises r0..r7, inner loop skipped (i+8 > n), tail skipped, returns tree sum of 8 accumulators. ✅
  • n = 9: inner loop skipped, tail adds first[8] to the tree sum. ✅
  • Complex types: T(0) constructs {0.0, 0.0} for std::complex<T>, all arithmetic works element-wise. ✅
  • Half-rounding can't produce zero: for n > 128, half = n/2 ≥ 64, and half - half%8 ≥ 56 > 0, so the recursion always terminates. ✅
  • Recursion depth: O(log₂(N/128)) — for N = 1 billion, depth ≈ 23, no stack concern. ✅

Issues

1. C++20 requirement needs explicit confirmation (Important)

pairwise_sum.hpp requires C++20:

  • #include <ranges> — C++20
  • std::ranges::random_access_range, std::ranges::range_value_t, std::ranges::begin, std::ranges::size — C++20 concepts/functions
  • std::random_access_iterator — C++20 concept
  • std::span in Sum_internal.cpp — C++20

The PR passes the full test suite, so the build system already compiles with C++20. But if CMakeLists.txt hasn't been updated to require C++20 (e.g., set(CMAKE_CXX_STANDARD 20)), a clean build on a C++17-defaulting toolchain will fail. Please confirm CMAKE_CXX_STANDARD is already set to ≥ 20, or update it in this PR.

2. Comment references non-existent stride_view.hpp (Minor)

// pass a strided view (see stride_view.hpp) to sum a strided sequence such as a matrix diagonal.

stride_view.hpp does not exist yet — it's deferred to the follow-up Trace PR (#834). A reader encountering this comment today will search for a file that isn't there. Change to:

// pass a strided view (see issue #834) to sum a strided sequence such as a matrix diagonal.

or remove the parenthetical until #834 lands.

3. Missing <complex> include in pairwise_sum.hpp (Minor)

T(0) for complex types requires std::complex to be visible. Currently this is satisfied transitively via the callers, but pairwise_sum.hpp should be self-contained. Add #include <complex> to the header.

4. Why half must be a multiple of 8 — comment is missing (Minor)

std::size_t half = n / 2;
half -= half % 8;

This alignment ensures child calls can enter the unrolled path, but there's no comment explaining it. A brief note like // align to block size so the child call enters the unrolled path would help future readers.


Strengths

  • Clean, self-contained header with a clear algorithm structure.
  • Handles empty, small, medium, and large inputs correctly.
  • The PairwiseSum wrapper accepts any random-access range, enabling future reuse for strided reductions without changing the core algorithm.
  • Re-enabling the Accuracy test (rather than just removing DISABLED_) is the right approach — it now serves as a permanent regression guard.
  • Integer types correctly keep the straight loop (no rounding needed).
  • GPU path is correctly left unchanged.

Recommendation

Approve after addressing:

  1. Confirm (or add) CMAKE_CXX_STANDARD 20 in CMakeLists.txt.
  2. Fix the dangling stride_view.hpp reference in the comment.
  3. (Optional) Add #include <complex> to pairwise_sum.hpp and a comment on the half % 8 alignment.

Copy link
Copy Markdown
Collaborator Author

@IvanaGyro IvanaGyro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Review written by Claude on behalf of @IvanaGyro.)

The switch to NumPy-style pairwise summation is a clean, well-documented improvement, and re-enabling the previously DISABLED_Accuracy test is the right move. The implementation looks correct: the empty range returns T(0), the <8 / <=128 / recursive-split branches are well-formed, and the powers-of-two block structure is what lets the homogeneous-value test now pass exactly. Two test-coverage gaps below; no blocking concerns with the production code itself.

TYPED_TEST(LinalgSumHomogeneousValuesTest, DISABLED_Accuracy) {
TYPED_TEST(LinalgSumHomogeneousValuesTest, Accuracy) {
TypeParam value = LinalgSumHomogeneousValuesTest<TypeParam>::value;
int element_number = 10000;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Review written by Claude on behalf of @IvanaGyro.)

This single element_number = 10000 case only exercises the recursive-split branch (plus the 8-accumulator base). The two boundary branches in PairwiseSumBlocks go untested: the serial n < 8 path and the 8 <= n <= 128 unrolled path, including the off-by-one tails (n not a multiple of 8). A regression that broke either branch would still pass CI.

Suggest adding sizes that straddle the thresholds, e.g. 7, 8, 9, 128, 129, in addition to 10000 — a small value-parameterized companion would cover all three code paths cheaply.

* This test also assesses the accuracy of summing floating-point numbers.
*/
TYPED_TEST(LinalgSumHomogeneousValuesTest, DISABLED_Accuracy) {
TYPED_TEST(LinalgSumHomogeneousValuesTest, Accuracy) {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Review written by Claude on behalf of @IvanaGyro.)

All the FP accuracy assertions here use homogeneous (identical) values, the one input distribution where even naive serial summation stays nearly exact. That doesn't demonstrate or guard the actual benefit of pairwise summation — improved accuracy under a large dynamic range of magnitudes.

Consider one heterogeneous-magnitude test for cytnx_double/cytnx_float where naive accumulation visibly loses precision but pairwise recovers it, e.g. a tensor like [1e16, 1, 1, ..., 1, -1e16] (the 1s vanish in a naive running sum once it reaches 1e16, but the balanced tree keeps them). That both documents intent and locks in the O(log N) error behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

linalg::Sum accumulates floating-point error; use pairwise summation like NumPy

2 participants