Skip to content

Extended precision for reduction operation in device_math#2541

Open
timofeymukha wants to merge 18 commits into
ExtremeFLOW:developfrom
timofeymukha:feature/xp_device_reductions
Open

Extended precision for reduction operation in device_math#2541
timofeymukha wants to merge 18 commits into
ExtremeFLOW:developfrom
timofeymukha:feature/xp_device_reductions

Conversation

@timofeymukha

@timofeymukha timofeymukha commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Changes gl* kernelds with additive accumulation to use extended precision for the accumulating variables / buffers.

For CUDA and HIP, the kernels are templated on T_acc, which is set to real_xp in the launching function.

For OpenCL, I had to just typedef double real_xp; in math_kernel.cl and use real_xp directly.

@timofeymukha timofeymukha requested a review from njansson May 20, 2026 13:34
@njansson

Copy link
Copy Markdown
Collaborator

I would suggest that instead of using real_xp directly, we use an additional template variable in addition to T

@timofeymukha

Copy link
Copy Markdown
Collaborator Author

I would suggest that instead of using real_xp directly, we use an additional template variable in addition to T

Darn, that's what AI wanted to do, and I told it not to :-)). Alrighty!

@timofeymukha timofeymukha changed the title Extended precision for reduction operation in device_math (CUDA and HIP) Extended precision for reduction operation in device_math May 21, 2026
@timofeymukha timofeymukha requested a review from Copilot May 21, 2026 08:49
@timofeymukha timofeymukha added the enhancement New feature or request label May 21, 2026
@timofeymukha timofeymukha marked this pull request as ready for review May 21, 2026 08:52
@njansson

Copy link
Copy Markdown
Collaborator

This will unfortunately break on apple silicon, where only single is supported.

How about we ifdef that typedef and retain real_xp as real if we are on macOS ?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the device-backend global reduction routines (glsum, glsc2, glsc3, glsubnorm, glsc3_many) to accumulate and reduce in extended precision (xp) on the device and across ranks/GPUs, then cast results back to the working precision (rp) at the API boundary. This improves numerical robustness (especially when rp is single precision) by reducing round-off during large reductions.

Changes:

  • Updated device_math to use xp temporaries and MPI_EXTRA_PRECISION for host-side MPI reductions, with final casts back to rp.
  • Updated CUDA/HIP/OpenCL device wrappers and kernels so the intermediate reduction buffers use real_xp and the kernels support an accumulator type (T_acc).
  • Added/extended backend support utilities for extra-precision reduction buffers and inter-GPU/rank reductions in CUDA/HIP.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/math/math.f90 Minor docstring correction for vlsc2 formula description.
src/math/bcknd/device/device_math.F90 Switches device reductions to accumulate in xp and MPI-reduce with MPI_EXTRA_PRECISION, then converts back to rp.
src/math/bcknd/device/cuda/cuda_math.f90 Updates CUDA Fortran interfaces to return/pass c_xp for gl* reductions.
src/math/bcknd/device/cuda/math.cu Implements extra-precision reduction buffers and global reduction helper; updates CUDA gl* wrappers to use real_xp.
src/math/bcknd/device/cuda/math_kernel.h Adds accumulator-type templates for CUDA reduction kernels and introduces vlsc3_kernel vs glsc3_kernel.
src/math/bcknd/device/hip/hip_math.f90 Updates HIP Fortran interfaces to return/pass c_xp for gl* reductions.
src/math/bcknd/device/hip/math.hip Implements extra-precision reduction buffers and global reduction helper; updates HIP gl* wrappers to use real_xp.
src/math/bcknd/device/hip/math_kernel.h Adds accumulator-type templates for HIP reduction kernels and introduces vlsc3_kernel vs glsc3_kernel.
src/math/bcknd/device/opencl/opencl_math.f90 Updates OpenCL Fortran interfaces to return/pass c_xp for gl* reductions.
src/math/bcknd/device/opencl/math.c Updates OpenCL gl* wrappers to use real_xp buffers and return real_xp.
src/math/bcknd/device/opencl/math_kernel.cl Updates OpenCL reduction kernels to write real_xp intermediates and accumulate with real_xp.
CHANGELOG.md Documents the move to extended precision reductions in device_math.

Comment thread src/math/bcknd/device/opencl/math_kernel.cl
Comment thread src/math/bcknd/device/cuda/math_kernel.h
Comment thread src/math/bcknd/device/hip/math_kernel.h
Comment thread src/math/bcknd/device/opencl/math.c
Comment thread src/math/bcknd/device/opencl/math.c
Comment thread src/math/bcknd/device/opencl/math.c
Comment thread src/math/bcknd/device/opencl/math.c
Comment thread CHANGELOG.md Outdated
@njansson

Copy link
Copy Markdown
Collaborator

This will unfortunately break on apple silicon, where only single is supported.

How about we ifdef that typedef and retain real_xp as real if we are on macOS ?

something like

#ifdef __APPLE__
 typedef real real_xp;
#else
typedef double real_xp;
#endif 

@njansson

Copy link
Copy Markdown
Collaborator

This will unfortunately break on apple silicon, where only single is supported.
How about we ifdef that typedef and retain real_xp as real if we are on macOS ?

something like

#ifdef __APPLE__
 typedef real real_xp;
#else
typedef double real_xp;
#endif 

also we do have a header where this can be set in the device folder

@timofeymukha timofeymukha requested a review from timfelle May 21, 2026 10:30
Comment thread src/math/bcknd/device/opencl/math_kernel.cl Outdated
@vbaconnet

Copy link
Copy Markdown
Collaborator

Thank you for doing this important work! I have two comments/questions:

  • Why only device_math and not host math? :D
  • Would there be instances where we would not want to truncate the reduction from xp to rp? In other words t keep the reduction result in xp? It depends on the use case I guess.

@timofeymukha

Copy link
Copy Markdown
Collaborator Author

Thank you for doing this important work! I have two comments/questions:

  • Why only device_math and not host math? :D
  • Would there be instances where we would not want to truncate the reduction from xp to rp? In other words, to keep the reduction result in xp? It depends on the use case I guess.

On the host, it is already implemented! Not 100% sure about the second point, but typically I would say that if you will use the values in rp computations down the line, there is not a lot of point in retaining one scalar in xp. But there may be exceptions, I guess.

@vbaconnet

Copy link
Copy Markdown
Collaborator

On the host, it is already implemented!

Indeed, my bad..

@njansson njansson enabled auto-merge June 15, 2026 14:11
@timfelle timfelle moved this from 📋 Todo to 🏗 In progress in Neko v1.1.0 release Jun 16, 2026
*/
template< typename T >
__global__ void glsc3_kernel(const T * a,
__global__ void vlsc3_kernel(const T * a,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vlsc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: 🏗 In progress

Development

Successfully merging this pull request may close these issues.

5 participants