Skip to content

Improve sqrt with SoftFloat-style lookup table and integer arithmetic#1331

Open
justinzhuguangwen wants to merge 35 commits intoboostorg:developfrom
justinzhuguangwen:develop
Open

Improve sqrt with SoftFloat-style lookup table and integer arithmetic#1331
justinzhuguangwen wants to merge 35 commits intoboostorg:developfrom
justinzhuguangwen:develop

Conversation

@justinzhuguangwen
Copy link

@justinzhuguangwen justinzhuguangwen commented Feb 4, 2026

Description

Summary

Replace Padé + Newton-Raphson sqrt with a SoftFloat-inspired lookup table and integer-only Newton refinement. All arithmetic stays in integers until final conversion, reducing rounding errors and improving throughput.

Changes

New files:

  • doc/decimal_float_and_sqrt.md: Documentation on decimal vs binary float representation, bit layout, and sqrt algorithm
  • include/boost/decimal/detail/cmath/impl/sqrt_lookup.hpp: 90-entry k0/k1 lookup table
  • include/boost/decimal/detail/cmath/impl/approx_recip_sqrt_impl.hpp: approx_recip_sqrt32 / approx_recip_sqrt64, integer-only 1/√x approximation via table lookup + Newton refinement
  • include/boost/decimal/detail/cmath/impl/sqrt32_impl.hpp: Integer sqrt for decimal32 using approx_recip_sqrt32 and Newton correction
  • include/boost/decimal/detail/cmath/impl/sqrt64_impl.hpp: Integer sqrt for decimal64 using approx_recip_sqrt64 and Newton correction
  • include/boost/decimal/detail/cmath/impl/sqrt128_impl.hpp: u256-based integer sqrt for decimal128, frexp10 for exact significand, round-to-nearest

Modified files:

  • sqrt.hpp: Switched to new impl headers, replacing Padé + Newton-Raphson implementation
  • test/test_cmath.cpp: Updated sqrt-related tests
  • test/test_sqrt.cpp: Updated sqrt-related tests

Performance (sqrt_bench.py, baseline vs current)

Type Baseline Current Speedup
decimal32_t 1.59M ops/s 8.90M ops/s ✓ 5.59x (+458.9%)
decimal64_t 0.78M ops/s 4.17M ops/s ✓ 5.33x (+432.9%)
decimal128_t 0.24M ops/s 0.77M ops/s ✓ 3.15x (+214.8%)
decimal_fast32_t 1.80M ops/s 8.51M ops/s ✓ 4.72x (+372.3%)
decimal_fast64_t 0.81M ops/s 3.74M ops/s ✓ 4.64x (+363.9%)
decimal_fast128_t 0.22M ops/s 0.72M ops/s ✓ 3.21x (+221.2%)

Note: sqrt_bench.py and ./run_srqt_test.sh are not part of this PR. To run them, use commit d54af195e45f1207c7d55fcdb26f5890d9aafbbd.

Tests

  • CI covers sqrt-related tests. For local runs, ./run_srqt_test.sh passes at commit d54af195e45f1207c7d55fcdb26f5890d9aafbbd (decimal32, decimal64, decimal128, github_issue_1107, github_issue_1110).

Documentation

  • doc/decimal_float_and_sqrt.md: New document describing current algorithm and implementation details

Refs #1311

justinzhu@home added 23 commits January 25, 2026 03:05
Convert decimal to double, use std::sqrt, convert back. Added baseline,
benchmarks, and build scripts. Updated .gitignore.
- Add decimal lookup table for 1/sqrt(x) approximation
- Implement SoftFloat-style remainder elimination
- Add safety checks to prevent inf/NaN results
- Maintain C++14 compatibility with out-of-line definitions

Performance: ~4-5x speedup over baseline implementation
Replace Padé approximation with 32-entry lookup table approach,
reducing Newton-Raphson iterations from 3-5 to 1-3 based on precision.

- Add sqrt lookup tables for [0.1, 1.0) range with interpolation
- Reduce iterations: decimal32 (1), decimal64 (2), decimal128 (3)
- Maintain existing normalization and exponent handling

Performance: Fewer divisions per sqrt call with faster convergence.

Known issues: Some precision gaps remain, particularly for decimal32.
Further optimization needed for interpolation and edge cases.
Replace Padé approximation with 1/sqrt(x) lookup table and linear
interpolation, reducing Newton iterations from 3-5 to 1-2.
Replace Padé + Newton-Raphson with decimal-native lookup table
algorithm. Uses 128-entry table with linear interpolation for
initial approximation, then remainder elimination refinement using
multiplication only (no division). Reduces iterations from 3-5 to 1-3.

Experimental: precision tuning may be needed.
- Replace 128-entry table with 90-entry table (step 0.1)
- Use integer remainder for decimal32 (uint64) and decimal64 (uint128)
- Add 1/2/4 Newton iterations for decimal32/64/128
- Add generate_sqrt_tables.py for table generation
- Update decimal_float_and_sqrt.md with algorithm details
- Rewrite approx_recip_sqrt32/64 to match SoftFloat algorithm
- Use working scale and Newton iterations to avoid precision loss
- Switch sqrt32_impl and sqrt64_impl to integer approx_recip_sqrt
- Fix sqrt32 and sqrt64 test failures
- approx_recip_sqrt32/64: rewrite with integer-only Newton iterations
- sqrt32_impl: use approx_recip_sqrt32 for integer-based sqrt
- sqrt64_impl: use approx_recip_sqrt64 with __int128 when available
- sqrt64_impl: fallback path for platforms without __int128 (MSVC, 32-bit)
- sqrt128_impl: u256-based integer sqrt with frexp10 and round-to-nearest

Performance (ops/s):
- decimal32_t:      1.63M -> 9.55M  (5.86x)
- decimal64_t:      0.74M -> 4.39M  (5.97x)
- decimal128_t:     0.31M -> 0.75M  (2.41x)
- decimal_fast32_t: 1.81M -> 8.63M  (4.76x)
- decimal_fast64_t: 0.80M -> 3.74M  (4.69x)
- decimal_fast128_t: 0.22M -> 0.69M (3.13x)

All sqrt tests pass for decimal32/64/128 and sqrt64 fallback path.
- sqrt_tables.hpp -> sqrt_lookup.hpp
- approx_recip_sqrt.hpp -> approx_recip_sqrt_impl.hpp
- remove approx_recip_sqrt_1
- approxRecipSqrt_1k0s/k1s -> recip_sqrt_k0s/k1s
- Add lookup table algorithm for sqrt optimization
- decimal32/64/128: 4.72x, 5.66x, 2.85x speedup respectively
- Remove temporary test files
- Add lookup table algorithm for sqrt optimization
- decimal32/64/128: 4.72x, 5.66x, 2.85x speedup respectively
- Remove temporary test files
@cppalliance-bot
Copy link

cppalliance-bot commented Feb 4, 2026

An automated preview of the documentation is available at https://1331.decimal.prtest3.cppalliance.org/libs/decimal/doc/html/index.html

If more commits are pushed to the pull request, the docs will rebuild at the same URL.

2026-02-05 16:45:50 UTC

@ckormanyos
Copy link
Member

ckormanyos commented Feb 4, 2026

Hi @justinzhuguangwen I just approved your workflow run. In this repo, first-time contributors require workflow run approval.

Looks like you're getting good perf results now. CI will be interesting.

You're getting hit by some compiler warnings. Decimal runs high warning levels with -Werror. I should have warned you a bit more on this up front.

So don't be initially shocked if you get a lot of failing runs in the first try. Usually, these matters can be cleared up in a few trials and everything gets OK.

Our CI is setup to run faster runs on GHA and a whole bunch of slow tests on Drone. Some of the drone tests already failed on the pesky -Werror. But again, these matters can usually be cleared up with a few edits.

In fact, I just got hit by -Werror on my recent frexp PR too.

Cc: @mborland

@justinzhuguangwen
Copy link
Author

Hi @ckormanyos, thanks for approving the workflow and for the heads-up about the strict warning levels.

I've pushed a fix for the -Wsign-conversion error in approx_recip_sqrt64 (in approx_recip_sqrt_impl.hpp). The change is in the latest commit on this PR.

I've started the workflow on my fork to verify the fix. Once it passes, I'll trigger a re-run here if needed.

Thanks again for the guidance.

Copy link
Member

@ckormanyos ckormanyos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might prefer:

std::uint64_t base_sig = static_cast<std::uint64_t>((static_cast<unsigned int>(index) + 10U) * UINT64_C(100000000000000));

I hope this is syntactically correct and lets the compiler do unsigned * u64 -> u64 multiply

@ckormanyos
Copy link
Member

ckormanyos commented Feb 4, 2026

I've started the workflow on my fork to verify the fix. Once it passes, I'll trigger a re-run here if needed.

Nice.

Oh also, when we commit to a branch having a running PR with a running CI/CD, Matt (@mborland) has set it up to halt the running CI/CD and restart it. This is done both on GHA as well as on Drone.

So it is not uncommon for us to commit, re-commmit, and so on until it gets right. You don't need to be, let's say, ultra-conservative about that.

@ckormanyos
Copy link
Member

ckormanyos commented Feb 4, 2026

@justinzhuguangwen thank you for your amazing algorithmic and coding work. It looks like this thing is going to go green soon.

Hi Matt (@mborland) would you like to look over this work regarding matters of style and Boost/Decimal consistency?

Upon going green, I'd like to get this into 1.91.

@justinzhuguangwen
Copy link
Author

Hi @ckormanyos, thanks for the kind words and for the review.

The CI checks triggered by the latest commit have all passed. Happy to address any follow-up if needed.

Note: This PR has many commits from incremental updates. When merging, I'd suggest using squash merge and using the PR description as the squash commit message to keep the history clean.

Thanks again for the guidance.

Copy link
Member

@mborland mborland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! My comments are really around other potential points of optimization, and I think they could apply a few different places in the code. Let me know if you have questions.

@ckormanyos
Copy link
Member

CI has been started for these next improvements.

@ckormanyos
Copy link
Member

Hi Justin @justinzhuguangwen, I think a trivial warning has crept in.

One of the CI logs shows:

2026-02-04T17:16:24.2135712Z ./boost/decimal/detail/cmath/impl/sqrt128_impl.hpp:72:29: error: unused variable 'scale15' [-Werror,-Wunused-variable]
2026-02-04T17:16:24.2136834Z     constexpr std::uint64_t scale15 = 1000000000000000ULL;    // 10^15
2026-02-04T17:16:24.2137441Z                             ^
2026-02-04T17:16:24.2137816Z 1 error generated.

It's annayong, ... we know, but probably worth a lot of quality in the end.

@justinzhuguangwen
Copy link
Author

Hi Justin @justinzhuguangwen, I think a trivial warning has crept in.

One of the CI logs shows:

2026-02-04T17:16:24.2135712Z ./boost/decimal/detail/cmath/impl/sqrt128_impl.hpp:72:29: error: unused variable 'scale15' [-Werror,-Wunused-variable]
2026-02-04T17:16:24.2136834Z     constexpr std::uint64_t scale15 = 1000000000000000ULL;    // 10^15
2026-02-04T17:16:24.2137441Z                             ^
2026-02-04T17:16:24.2137816Z 1 error generated.

It's annayong, ... we know, but probably worth a lot of quality in the end.

Fixed. Running CI in my fork to confirm.

@codecov
Copy link

codecov bot commented Feb 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.8%. Comparing base (e72fd34) to head (1f10e9c).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           develop   #1331     +/-   ##
=========================================
+ Coverage     98.8%   98.8%   +0.1%     
=========================================
  Files          278     282      +4     
  Lines        18034   18205    +171     
  Branches      1918    1917      -1     
=========================================
+ Hits         17812   17984    +172     
+ Misses         222     221      -1     
Files with missing lines Coverage Δ
...cimal/detail/cmath/impl/approx_recip_sqrt_impl.hpp 100.0% <100.0%> (ø)
...e/boost/decimal/detail/cmath/impl/sqrt128_impl.hpp 100.0% <100.0%> (ø)
...de/boost/decimal/detail/cmath/impl/sqrt32_impl.hpp 100.0% <100.0%> (ø)
...de/boost/decimal/detail/cmath/impl/sqrt64_impl.hpp 100.0% <100.0%> (ø)
include/boost/decimal/detail/cmath/sqrt.hpp 100.0% <100.0%> (ø)
test/test_cmath.cpp 100.0% <ø> (ø)
test/test_sqrt.cpp 100.0% <100.0%> (ø)

... and 3 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e72fd34...1f10e9c. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Add decimal_fast32_t, decimal128_t test_sqrt_edge, decimal_fast128_t
- Add perfect squares (sqrt(4), sqrt(9)) to cover rem==0 branch
- Add non-perfect squares (sqrt(2), sqrt(5)) to cover Newton correction
- Add dense sampling [1.01, 9.99] to exercise rem<0 and other edge paths
@ckormanyos
Copy link
Member

ckormanyos commented Feb 5, 2026

Hi @justinzhuguangwen sorry, it took me a while to approve the CI run.

I see you're going after code coverage. Each of the 32, 64, 128-bit impl files has a rescaling of gx within the range $gx{\in}[1, 10)$.

But I was wondering if those lines can ever be reached?

The calling instance (the dispatcher) in the sqrt impl has already normalized the argument. I wonder if those secondary checks are actually reachable or unreachable lines? But I'm kind of unsure here. It's always good to have a double-check. But if they're unreachable, they might be not needed?

@justinzhuguangwen
Copy link
Author

Hi @ckormanyos,

Thanks for the review, and no worries about the CI delay.

Those normalization loops are indeed unreachable. The dispatcher in sqrt.hpp already normalizes via frexp10 and T{sig, -(digits10-1)}, so gx is always in [1, 10) when passed to the impl functions. The while (gx >= 10) and while (gx < 1) loops in the impls can never run.

I've removed this redundant defensive code in commit 30db62d.

Local LCOV shows 100% line coverage for the sqrt-related impl files (sqrt32_impl.hpp, sqrt64_impl.hpp, sqrt128_impl.hpp, approx_recip_sqrt_impl.hpp).

@mborland
Copy link
Member

mborland commented Feb 5, 2026

Looks like this is getting close, but is failing on ARM64 and S390X:

====== BEGIN OUTPUT ======
a: 1004987562112089.02696744600636782
b: 31780497.164141406803720633487654
delta: 31622775.601683793319131798367038
tol: 6.16297582203915472977912941627177e-33
libs/decimal/test/test_sqrt.cpp(296): test 'dense_ok' failed in function 'bool local::test_sqrt_edge() [with DecimalType = boost::decimal::decimal128_t; FloatType = long double]'
libs/decimal/test/test_sqrt.cpp(504): test 'result_edge_is_ok' failed in function 'int main()'
a: 0.266420005605452063686250215730292
b: 266420005605452.063671807558286741
delta: 0.999999999999999
tol: 1.23259516440783094595582588325435e-32
x_flt  : 7.097961938680910903567440323417026e-02
val_flt: 2.664200056054520636862502157302923e-01
val_dec: 2.664200056054520636862502157302922e-01
libs/decimal/test/test_sqrt.cpp(130): test 'result_is_ok' failed in function 'bool local::test_sqrt(int32_t, long double, long double) [with DecimalType = boost::decimal::decimal_fast128_t; FloatType = long double; int32_t = int]'
a: 3.09523025059134142969267962822244
b: 3095230250591341.42952488686012074
delta: 0.999999999999999
tol: 1.23259516440783094595582588325435e-32
x_flt  : 9.580450304175738262875745064136701e+00
val_flt: 3.095230250591341429692679628222443e+00
val_dec: 3.095230250591341429692679628222443e+00
libs/decimal/test/test_sqrt.cpp(130): test 'result_is_ok' failed in function 'bool local::test_sqrt(int32_t, long double, long double) [with DecimalType = boost::decimal::decimal_fast128_t; FloatType = long double; int32_t = int]'
a: 2.34545724933813204960411295291378e+37
b: 2.34545724933813204947696546065357e+52
delta: 0.999999999999999
tol: 1.23259516440783094595582588325435e-32
x_flt  : 5.501169708472796534840721894222325e+74
val_flt: 2.345457249338132049604112952913781e+37
val_dec: 2.345457249338132049604112952913781e+37
libs/decimal/test/test_sqrt.cpp(130): test 'result_is_ok' failed in function 'bool local::test_sqrt(int32_t, long double, long double) [with DecimalType = boost::decimal::decimal_fast128_t; FloatType = long double; int32_t = int]'
libs/decimal/test/test_sqrt.cpp(517): test 'result_small_is_ok' failed in function 'int main()'
libs/decimal/test/test_sqrt.cpp(518): test 'result_medium_is_ok' failed in function 'int main()'
libs/decimal/test/test_sqrt.cpp(519): test 'result_large_is_ok' failed in function 'int main()'
a: 1004987562112089.02696744600636782
b: 31780497.164141406803720633487654
delta: 31622775.601683793319131798367038
tol: 6.16297582203915472977912941627177e-33
libs/decimal/test/test_sqrt.cpp(296): test 'dense_ok' failed in function 'bool local::test_sqrt_edge() [with DecimalType = boost::decimal::decimal_fast128_t; FloatType = long double]'
libs/decimal/test/test_sqrt.cpp(526): test 'result_is_ok' failed in function 'int main()'
10 errors detected.

EXIT STATUS: 10
====== END OUTPUT ======

One thing to be aware of is that on both of these platforms long doubles are 128-bits, so we likely have to adjust tolerances a bit.

@ckormanyos
Copy link
Member

ckormanyos commented Feb 5, 2026

Ummm to be honest, I don't think any of these new tests, nor these more new tests are intended to work for most ptatforms having 64-bit or 80-bit long double.

These tests might be mostly working by chance. I think for the new dense tests, as well as for the new random value-ckeck tests at 128-bit, we might have to jump up to a multiprecision control-type in order to get platform-independent, reliable testing-versus-128-bit-decimal. This is pretty straightforward, when we finally get, for instance, cpp_dec_float in the tsts, which is about to happen in #1330.

@ckormanyos
Copy link
Member

ckormanyos commented Feb 5, 2026

In particular, I mean since the tolerance is based on a tolerance-factor multiplied by the epsilon of the built-in control type, then we are testing 128-bit decimal sqrt with the tolerance of, say, 16*epsilon() for 64-bit long double. These will pass, but they're not really tight tests.

@justinzhuguangwen
Copy link
Author

Thanks @mborland for the detailed failure report. The root cause aligns with @ckormanyos's analysis: on ARM64 and S390X, long double is 128-bit, so std::numeric_limits<long double>::epsilon() is much smaller than on x86 (64-bit or 80-bit long double), and the current tolerance factor (tol_factor * epsilon) becomes too strict for these platforms.

Switching to a multiprecision control type (e.g. cpp_dec_float from #1330) for 128-bit decimal sqrt testing would give platform-independent behavior and more reliable checks. Until then, we could consider platform-specific tolerance adjustments for these architectures, though that would only be a temporary workaround.


One question: Why doesn't the fork's GitHub Actions CI report these ARM64 and S390X failures?

@mborland
Copy link
Member

mborland commented Feb 5, 2026

These architectures are in a different CI system called drone. They're only available for the main repo unlike GitHub actions.

@ckormanyos
Copy link
Member

Why doesn't the fork's GitHub Actions CI report these ARM64 and S390X failures?

These runs as well as those for most of the older compilers run in a Drone process. The Drone CI looks like one single check in GHA, but it does not run on GHA. See the image below. You need an account at drone, or we have to give you one (I forgot which) in order to see the many, many individual runners running on Drone.

I believe when you run CI on your own develop branch, Drone simply will not run.

image

@ckormanyos
Copy link
Member

I'd say if this checkin does what's expected, we can simply go with it. If we decide to do something more fancy with Multiprecision down the road, I can explicitly help with that.

@ckormanyos
Copy link
Member

The failure in test_asinh is an unrelated tolerance issue. It looks like a random intolerant argument was found. I'll look into this independently.

@justinzhuguangwen
Copy link
Author

Thanks @mborland and @ckormanyos for the investigation and the Drone CI explanation. I’m not sure what I should do next on my side—if you need any changes or follow-up from me, just say so.

@ckormanyos
Copy link
Member

ckormanyos commented Feb 6, 2026

not sure what I should do next on my side

Many, many thanks Justin for this excellent contribution.

This looks good to go to me. This effort is done in my opinion. I'm ready to merge it any time. CI error is a spurious tolerance issue unrelated (I'll address that independently).

Since this is a relatively big change and because Matt was so involved in this PR also, I'll wait for his nod prior to merging.

Matt? Ready on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants