Improve sqrt with SoftFloat-style lookup table and integer arithmetic#1331
Improve sqrt with SoftFloat-style lookup table and integer arithmetic#1331justinzhuguangwen wants to merge 35 commits intoboostorg:developfrom
Conversation
Convert decimal to double, use std::sqrt, convert back. Added baseline, benchmarks, and build scripts. Updated .gitignore.
- Add decimal lookup table for 1/sqrt(x) approximation - Implement SoftFloat-style remainder elimination - Add safety checks to prevent inf/NaN results - Maintain C++14 compatibility with out-of-line definitions Performance: ~4-5x speedup over baseline implementation
Replace Padé approximation with 32-entry lookup table approach, reducing Newton-Raphson iterations from 3-5 to 1-3 based on precision. - Add sqrt lookup tables for [0.1, 1.0) range with interpolation - Reduce iterations: decimal32 (1), decimal64 (2), decimal128 (3) - Maintain existing normalization and exponent handling Performance: Fewer divisions per sqrt call with faster convergence. Known issues: Some precision gaps remain, particularly for decimal32. Further optimization needed for interpolation and edge cases.
Replace Padé approximation with 1/sqrt(x) lookup table and linear interpolation, reducing Newton iterations from 3-5 to 1-2.
Replace Padé + Newton-Raphson with decimal-native lookup table algorithm. Uses 128-entry table with linear interpolation for initial approximation, then remainder elimination refinement using multiplication only (no division). Reduces iterations from 3-5 to 1-3. Experimental: precision tuning may be needed.
…nd sqrt optimization
…nd remove sqrt table
… remainder mapping
…red tables, C++14 tag dispatch
- Replace 128-entry table with 90-entry table (step 0.1) - Use integer remainder for decimal32 (uint64) and decimal64 (uint128) - Add 1/2/4 Newton iterations for decimal32/64/128 - Add generate_sqrt_tables.py for table generation - Update decimal_float_and_sqrt.md with algorithm details
- Rewrite approx_recip_sqrt32/64 to match SoftFloat algorithm - Use working scale and Newton iterations to avoid precision loss - Switch sqrt32_impl and sqrt64_impl to integer approx_recip_sqrt - Fix sqrt32 and sqrt64 test failures
- approx_recip_sqrt32/64: rewrite with integer-only Newton iterations - sqrt32_impl: use approx_recip_sqrt32 for integer-based sqrt - sqrt64_impl: use approx_recip_sqrt64 with __int128 when available - sqrt64_impl: fallback path for platforms without __int128 (MSVC, 32-bit) - sqrt128_impl: u256-based integer sqrt with frexp10 and round-to-nearest Performance (ops/s): - decimal32_t: 1.63M -> 9.55M (5.86x) - decimal64_t: 0.74M -> 4.39M (5.97x) - decimal128_t: 0.31M -> 0.75M (2.41x) - decimal_fast32_t: 1.81M -> 8.63M (4.76x) - decimal_fast64_t: 0.80M -> 3.74M (4.69x) - decimal_fast128_t: 0.22M -> 0.69M (3.13x) All sqrt tests pass for decimal32/64/128 and sqrt64 fallback path.
- sqrt_tables.hpp -> sqrt_lookup.hpp - approx_recip_sqrt.hpp -> approx_recip_sqrt_impl.hpp - remove approx_recip_sqrt_1 - approxRecipSqrt_1k0s/k1s -> recip_sqrt_k0s/k1s
- Add lookup table algorithm for sqrt optimization - decimal32/64/128: 4.72x, 5.66x, 2.85x speedup respectively - Remove temporary test files
- Add lookup table algorithm for sqrt optimization - decimal32/64/128: 4.72x, 5.66x, 2.85x speedup respectively - Remove temporary test files
|
An automated preview of the documentation is available at https://1331.decimal.prtest3.cppalliance.org/libs/decimal/doc/html/index.html If more commits are pushed to the pull request, the docs will rebuild at the same URL. 2026-02-05 16:45:50 UTC |
|
Hi @justinzhuguangwen I just approved your workflow run. In this repo, first-time contributors require workflow run approval. Looks like you're getting good perf results now. CI will be interesting. You're getting hit by some compiler warnings. Decimal runs high warning levels with So don't be initially shocked if you get a lot of failing runs in the first try. Usually, these matters can be cleared up in a few trials and everything gets OK. Our CI is setup to run faster runs on GHA and a whole bunch of slow tests on Drone. Some of the drone tests already failed on the pesky In fact, I just got hit by Cc: @mborland |
|
Hi @ckormanyos, thanks for approving the workflow and for the heads-up about the strict warning levels. I've pushed a fix for the I've started the workflow on my fork to verify the fix. Once it passes, I'll trigger a re-run here if needed. Thanks again for the guidance. |
ckormanyos
left a comment
There was a problem hiding this comment.
You might prefer:
std::uint64_t base_sig = static_cast<std::uint64_t>((static_cast<unsigned int>(index) + 10U) * UINT64_C(100000000000000));
I hope this is syntactically correct and lets the compiler do unsigned * u64 -> u64 multiply
Nice. Oh also, when we commit to a branch having a running PR with a running CI/CD, Matt (@mborland) has set it up to halt the running CI/CD and restart it. This is done both on GHA as well as on Drone. So it is not uncommon for us to commit, re-commmit, and so on until it gets right. You don't need to be, let's say, ultra-conservative about that. |
|
@justinzhuguangwen thank you for your amazing algorithmic and coding work. It looks like this thing is going to go green soon. Hi Matt (@mborland) would you like to look over this work regarding matters of style and Boost/Decimal consistency? Upon going green, I'd like to get this into 1.91. |
|
Hi @ckormanyos, thanks for the kind words and for the review. The CI checks triggered by the latest commit have all passed. Happy to address any follow-up if needed. Note: This PR has many commits from incremental updates. When merging, I'd suggest using squash merge and using the PR description as the squash commit message to keep the history clean. Thanks again for the guidance. |
mborland
left a comment
There was a problem hiding this comment.
This looks really good! My comments are really around other potential points of optimization, and I think they could apply a few different places in the code. Let me know if you have questions.
|
CI has been started for these next improvements. |
|
Hi Justin @justinzhuguangwen, I think a trivial warning has crept in. One of the CI logs shows: It's annayong, ... we know, but probably worth a lot of quality in the end. |
Fixed. Running CI in my fork to confirm. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #1331 +/- ##
=========================================
+ Coverage 98.8% 98.8% +0.1%
=========================================
Files 278 282 +4
Lines 18034 18205 +171
Branches 1918 1917 -1
=========================================
+ Hits 17812 17984 +172
+ Misses 222 221 -1
... and 3 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
- Add decimal_fast32_t, decimal128_t test_sqrt_edge, decimal_fast128_t - Add perfect squares (sqrt(4), sqrt(9)) to cover rem==0 branch - Add non-perfect squares (sqrt(2), sqrt(5)) to cover Newton correction - Add dense sampling [1.01, 9.99] to exercise rem<0 and other edge paths
|
Hi @justinzhuguangwen sorry, it took me a while to approve the CI run. I see you're going after code coverage. Each of the 32, 64, 128-bit But I was wondering if those lines can ever be reached? The calling instance (the dispatcher) in the |
|
Hi @ckormanyos, Thanks for the review, and no worries about the CI delay. Those normalization loops are indeed unreachable. The dispatcher in I've removed this redundant defensive code in commit 30db62d. Local LCOV shows 100% line coverage for the sqrt-related impl files ( |
|
Looks like this is getting close, but is failing on ARM64 and S390X: One thing to be aware of is that on both of these platforms long doubles are 128-bits, so we likely have to adjust tolerances a bit. |
|
Ummm to be honest, I don't think any of these new tests, nor these more new tests are intended to work for most ptatforms having 64-bit or 80-bit These tests might be mostly working by chance. I think for the new dense tests, as well as for the new random value-ckeck tests at 128-bit, we might have to jump up to a multiprecision control-type in order to get platform-independent, reliable testing-versus-128-bit-decimal. This is pretty straightforward, when we finally get, for instance, |
|
In particular, I mean since the tolerance is based on a tolerance-factor multiplied by the |
…conversion issues
|
Thanks @mborland for the detailed failure report. The root cause aligns with @ckormanyos's analysis: on ARM64 and S390X, Switching to a multiprecision control type (e.g. One question: Why doesn't the fork's GitHub Actions CI report these ARM64 and S390X failures? |
|
These architectures are in a different CI system called drone. They're only available for the main repo unlike GitHub actions. |
|
I'd say if this checkin does what's expected, we can simply go with it. If we decide to do something more fancy with Multiprecision down the road, I can explicitly help with that. |
|
The failure in |
|
Thanks @mborland and @ckormanyos for the investigation and the Drone CI explanation. I’m not sure what I should do next on my side—if you need any changes or follow-up from me, just say so. |
Many, many thanks Justin for this excellent contribution. This looks good to go to me. This effort is done in my opinion. I'm ready to merge it any time. CI error is a spurious tolerance issue unrelated (I'll address that independently). Since this is a relatively big change and because Matt was so involved in this PR also, I'll wait for his nod prior to merging. Matt? Ready on this? |

Description
Summary
Replace Padé + Newton-Raphson sqrt with a SoftFloat-inspired lookup table and integer-only Newton refinement. All arithmetic stays in integers until final conversion, reducing rounding errors and improving throughput.
Changes
New files:
doc/decimal_float_and_sqrt.md: Documentation on decimal vs binary float representation, bit layout, and sqrt algorithminclude/boost/decimal/detail/cmath/impl/sqrt_lookup.hpp: 90-entry k0/k1 lookup tableinclude/boost/decimal/detail/cmath/impl/approx_recip_sqrt_impl.hpp:approx_recip_sqrt32/approx_recip_sqrt64, integer-only 1/√x approximation via table lookup + Newton refinementinclude/boost/decimal/detail/cmath/impl/sqrt32_impl.hpp: Integer sqrt for decimal32 usingapprox_recip_sqrt32and Newton correctioninclude/boost/decimal/detail/cmath/impl/sqrt64_impl.hpp: Integer sqrt for decimal64 usingapprox_recip_sqrt64and Newton correctioninclude/boost/decimal/detail/cmath/impl/sqrt128_impl.hpp: u256-based integer sqrt for decimal128,frexp10for exact significand, round-to-nearestModified files:
sqrt.hpp: Switched to new impl headers, replacing Padé + Newton-Raphson implementationtest/test_cmath.cpp: Updated sqrt-related teststest/test_sqrt.cpp: Updated sqrt-related testsPerformance (sqrt_bench.py, baseline vs current)
Note:
sqrt_bench.pyand./run_srqt_test.share not part of this PR. To run them, use commitd54af195e45f1207c7d55fcdb26f5890d9aafbbd.Tests
./run_srqt_test.shpasses at commitd54af195e45f1207c7d55fcdb26f5890d9aafbbd(decimal32, decimal64, decimal128, github_issue_1107, github_issue_1110).Documentation
doc/decimal_float_and_sqrt.md: New document describing current algorithm and implementation detailsRefs #1311