fix(sum): use pairwise summation for floating-point linalg::Sum#849
fix(sum): use pairwise summation for floating-point linalg::Sum#849IvanaGyro wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request replaces naive summation loops in Sum_internal.cpp with a recursive pairwise summation algorithm (PairwiseSum) implemented in pairwise_sum.hpp to improve floating-point accuracy, and enables the corresponding accuracy tests. The review feedback highlights a critical safety issue in StrideView::Iterator where storing a pointer to the view can lead to a dangling pointer, and recommends storing the underlying range's iterator instead, along with updating the begin() and end() methods.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #849 +/- ##
==========================================
- Coverage 36.89% 28.70% -8.20%
==========================================
Files 213 242 +29
Lines 31032 35272 +4240
Branches 12738 14628 +1890
==========================================
- Hits 11449 10124 -1325
- Misses 17719 18161 +442
- Partials 1864 6987 +5123
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 156 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
e355381 to
4c76053
Compare
IvanaGyro
left a comment
There was a problem hiding this comment.
(Review written by Claude on behalf of @IvanaGyro.)
The pairwise-summation core looks correct to me. PairwiseSumBlocks covers every element exactly once across the three regimes (<8 straight loop, the 8-accumulator block for [8,128] with its scalar tail, and the n>128 split where half is rounded down to a multiple of 8 and is always ≥64), and the reduction order matches NumPy's np.add.reduce. Empty ranges return T(0), preserving the old behavior. The Sum_internal rewrite and the test re-enable are clean.
One additive note on top of the existing comments — see the inline comment about StrideView being unexercised in this PR.
|
(Comment written by Claude on behalf of @IvanaGyro.) Test-coverage note (in addition to the |
4c76053 to
38fdf4f
Compare
IvanaGyro
left a comment
There was a problem hiding this comment.
(Review written by Claude on behalf of @IvanaGyro.)
Thanks for addressing the earlier round — stride_view/iterator naming, the dangling-pointer fix (iterator now stores the base iterator, with a dedicated IteratorsOutliveTheView test), the move to its own stride_view.hpp, the | stride(k) pipe adaptor, and the static_asserts on the range concepts all look good, and the fix is correct. One blocking issue: the new test file isn't registered in the build — details inline.
linalg::Sum accumulated every element into one scalar in a serial loop, whose worst-case rounding error grows as O(N * eps). The Accuracy test in tests/linalg_test/sum_test.cpp (10000 identical floating-point values whose bit patterns are chosen to expose round-off) failed under naive accumulation and was disabled. Add a pairwise-summation primitive (pairwise_sum.hpp): PairwiseSum mirrors NumPy's np.add.reduce -- a straight loop below 8 elements, an eight-accumulator unrolled block up to 128, and a halve-and-recurse split rounded to a multiple of eight above that. Worst-case error growth drops to O(log N * eps) at essentially the same cost. PairwiseSum consumes any random-access range, so it also serves strided reductions later. Rewrite the Float/Double/ComplexFloat/ComplexDouble branches of Sum_internal to call PairwiseSum over the contiguous storage span and re-enable the accuracy test. Integer branches keep the straight loop since they do not round. The GPU path (cuReduce_gpu) already performs a block/tree reduction with bounded error, so it is left unchanged. For the disabled test's values the new code lands within 2 ULP of the reference, against the 4-ULP tolerance of EXPECT_NUMBER_EQ; naive accumulation was 108 ULP off. Co-authored-by: Claude <noreply@anthropic.com>
38fdf4f to
3a5a680
Compare
PR #849 Review:
|
IvanaGyro
left a comment
There was a problem hiding this comment.
(Review written by Claude on behalf of @IvanaGyro.)
The switch to NumPy-style pairwise summation is a clean, well-documented improvement, and re-enabling the previously DISABLED_Accuracy test is the right move. The implementation looks correct: the empty range returns T(0), the <8 / <=128 / recursive-split branches are well-formed, and the powers-of-two block structure is what lets the homogeneous-value test now pass exactly. Two test-coverage gaps below; no blocking concerns with the production code itself.
| TYPED_TEST(LinalgSumHomogeneousValuesTest, DISABLED_Accuracy) { | ||
| TYPED_TEST(LinalgSumHomogeneousValuesTest, Accuracy) { | ||
| TypeParam value = LinalgSumHomogeneousValuesTest<TypeParam>::value; | ||
| int element_number = 10000; |
There was a problem hiding this comment.
(Review written by Claude on behalf of @IvanaGyro.)
This single element_number = 10000 case only exercises the recursive-split branch (plus the 8-accumulator base). The two boundary branches in PairwiseSumBlocks go untested: the serial n < 8 path and the 8 <= n <= 128 unrolled path, including the off-by-one tails (n not a multiple of 8). A regression that broke either branch would still pass CI.
Suggest adding sizes that straddle the thresholds, e.g. 7, 8, 9, 128, 129, in addition to 10000 — a small value-parameterized companion would cover all three code paths cheaply.
| * This test also assesses the accuracy of summing floating-point numbers. | ||
| */ | ||
| TYPED_TEST(LinalgSumHomogeneousValuesTest, DISABLED_Accuracy) { | ||
| TYPED_TEST(LinalgSumHomogeneousValuesTest, Accuracy) { |
There was a problem hiding this comment.
(Review written by Claude on behalf of @IvanaGyro.)
All the FP accuracy assertions here use homogeneous (identical) values, the one input distribution where even naive serial summation stays nearly exact. That doesn't demonstrate or guard the actual benefit of pairwise summation — improved accuracy under a large dynamic range of magnitudes.
Consider one heterogeneous-magnitude test for cytnx_double/cytnx_float where naive accumulation visibly loses precision but pairwise recovers it, e.g. a tensor like [1e16, 1, 1, ..., 1, -1e16] (the 1s vanish in a naive running sum once it reaches 1e16, but the balanced tree keeps them). That both documents intent and locks in the O(log N) error behavior.
Fixes #835.
Problem
linalg::Sumaccumulated every element into one scalar in a serial loop, whose worst-case rounding error grows asO(N · eps). TheAccuracytest intests/linalg_test/sum_test.cpp(10000 identical floating-point values whose bit patterns are chosen to expose round-off) failed under naive accumulation and was therefore disabled.Change
src/backend/linalg_internal_cpu/pairwise_sum.hpp:StrideView— a C++20 random-access, sized view over every k-th element of a range.PairwiseSum— NumPynp.add.reduce-style summation: straight loop below 8 elements, an 8-accumulator unrolled block up to 128, and a halve-and-recurse split (rounded to a multiple of 8) above that. Worst-case error growth drops toO(log N · eps)at essentially the same cost.Float/Double/ComplexFloat/ComplexDoublebranches ofSum_internalto usePairwiseSum. Integer branches keep the straight loop (they do not round).DISABLED_Accuracy→Accuracy).cuReduce_gpu) already performs a block/tree reduction with bounded error, so it is unchanged and itsGpuAccuracytest stays enabled.StrideViewis intentionally reusable for strided reductions (e.g. a matrix diagonal); the follow-up Trace refactor (#834) builds on it.Test plan
openblas-cpu: full suite 948/948 passed (includes the 10 re-enabledLinalgSumHomogeneousValuesTest.Accuracy<...>cases).mkl-cpu: full suite 948/948 passed.EXPECT_NUMBER_EQtolerates 4 ULP); naive accumulation was 108 ULP off.🤖 Draft — opened for review.
Generated by Claude Code