Extended precision for reduction operation in device_math#2541
Extended precision for reduction operation in device_math#2541timofeymukha wants to merge 18 commits into
Conversation
|
I would suggest that instead of using |
Darn, that's what AI wanted to do, and I told it not to :-)). Alrighty! |
|
This will unfortunately break on apple silicon, where only How about we ifdef that typedef and retain |
There was a problem hiding this comment.
Pull request overview
This PR updates the device-backend global reduction routines (glsum, glsc2, glsc3, glsubnorm, glsc3_many) to accumulate and reduce in extended precision (xp) on the device and across ranks/GPUs, then cast results back to the working precision (rp) at the API boundary. This improves numerical robustness (especially when rp is single precision) by reducing round-off during large reductions.
Changes:
- Updated
device_mathto usexptemporaries andMPI_EXTRA_PRECISIONfor host-side MPI reductions, with final casts back torp. - Updated CUDA/HIP/OpenCL device wrappers and kernels so the intermediate reduction buffers use
real_xpand the kernels support an accumulator type (T_acc). - Added/extended backend support utilities for extra-precision reduction buffers and inter-GPU/rank reductions in CUDA/HIP.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/math/math.f90 | Minor docstring correction for vlsc2 formula description. |
| src/math/bcknd/device/device_math.F90 | Switches device reductions to accumulate in xp and MPI-reduce with MPI_EXTRA_PRECISION, then converts back to rp. |
| src/math/bcknd/device/cuda/cuda_math.f90 | Updates CUDA Fortran interfaces to return/pass c_xp for gl* reductions. |
| src/math/bcknd/device/cuda/math.cu | Implements extra-precision reduction buffers and global reduction helper; updates CUDA gl* wrappers to use real_xp. |
| src/math/bcknd/device/cuda/math_kernel.h | Adds accumulator-type templates for CUDA reduction kernels and introduces vlsc3_kernel vs glsc3_kernel. |
| src/math/bcknd/device/hip/hip_math.f90 | Updates HIP Fortran interfaces to return/pass c_xp for gl* reductions. |
| src/math/bcknd/device/hip/math.hip | Implements extra-precision reduction buffers and global reduction helper; updates HIP gl* wrappers to use real_xp. |
| src/math/bcknd/device/hip/math_kernel.h | Adds accumulator-type templates for HIP reduction kernels and introduces vlsc3_kernel vs glsc3_kernel. |
| src/math/bcknd/device/opencl/opencl_math.f90 | Updates OpenCL Fortran interfaces to return/pass c_xp for gl* reductions. |
| src/math/bcknd/device/opencl/math.c | Updates OpenCL gl* wrappers to use real_xp buffers and return real_xp. |
| src/math/bcknd/device/opencl/math_kernel.cl | Updates OpenCL reduction kernels to write real_xp intermediates and accumulate with real_xp. |
| CHANGELOG.md | Documents the move to extended precision reductions in device_math. |
something like #ifdef __APPLE__
typedef real real_xp;
#else
typedef double real_xp;
#endif |
also we do have a header where this can be set in the device folder |
|
Thank you for doing this important work! I have two comments/questions:
|
On the host, it is already implemented! Not 100% sure about the second point, but typically I would say that if you will use the values in rp computations down the line, there is not a lot of point in retaining one scalar in xp. But there may be exceptions, I guess. |
Indeed, my bad.. |
| */ | ||
| template< typename T > | ||
| __global__ void glsc3_kernel(const T * a, | ||
| __global__ void vlsc3_kernel(const T * a, |
Changes gl* kernelds with additive accumulation to use extended precision for the accumulating variables / buffers.
For CUDA and HIP, the kernels are templated on
T_acc, which is set toreal_xpin the launching function.For OpenCL, I had to just
typedef double real_xp;in math_kernel.cl and use real_xp directly.