Extended precision for reduction operation in device_math by timofeymukha · Pull Request #2541 · ExtremeFLOW/neko

timofeymukha · 2026-05-20T13:34:21Z

Changes gl* kernelds with additive accumulation to use extended precision for the accumulating variables / buffers.

For CUDA and HIP, the kernels are templated on T_acc, which is set to real_xp in the launching function.

For OpenCL, I had to just typedef double real_xp; in math_kernel.cl and use real_xp directly.

njansson · 2026-05-20T16:28:50Z

I would suggest that instead of using real_xp directly, we use an additional template variable in addition to T

timofeymukha · 2026-05-21T07:47:58Z

I would suggest that instead of using real_xp directly, we use an additional template variable in addition to T

Darn, that's what AI wanted to do, and I told it not to :-)). Alrighty!

njansson · 2026-05-21T08:55:55Z

This will unfortunately break on apple silicon, where only single is supported.

How about we ifdef that typedef and retain real_xp as real if we are on macOS ?

Copilot

Pull request overview

This PR updates the device-backend global reduction routines (glsum, glsc2, glsc3, glsubnorm, glsc3_many) to accumulate and reduce in extended precision (xp) on the device and across ranks/GPUs, then cast results back to the working precision (rp) at the API boundary. This improves numerical robustness (especially when rp is single precision) by reducing round-off during large reductions.

Changes:

Updated device_math to use xp temporaries and MPI_EXTRA_PRECISION for host-side MPI reductions, with final casts back to rp.
Updated CUDA/HIP/OpenCL device wrappers and kernels so the intermediate reduction buffers use real_xp and the kernels support an accumulator type (T_acc).
Added/extended backend support utilities for extra-precision reduction buffers and inter-GPU/rank reductions in CUDA/HIP.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/math/math.f90	Minor docstring correction for `vlsc2` formula description.
src/math/bcknd/device/device_math.F90	Switches device reductions to accumulate in `xp` and MPI-reduce with `MPI_EXTRA_PRECISION`, then converts back to `rp`.
src/math/bcknd/device/cuda/cuda_math.f90	Updates CUDA Fortran interfaces to return/pass `c_xp` for `gl*` reductions.
src/math/bcknd/device/cuda/math.cu	Implements extra-precision reduction buffers and global reduction helper; updates CUDA `gl*` wrappers to use `real_xp`.
src/math/bcknd/device/cuda/math_kernel.h	Adds accumulator-type templates for CUDA reduction kernels and introduces `vlsc3_kernel` vs `glsc3_kernel`.
src/math/bcknd/device/hip/hip_math.f90	Updates HIP Fortran interfaces to return/pass `c_xp` for `gl*` reductions.
src/math/bcknd/device/hip/math.hip	Implements extra-precision reduction buffers and global reduction helper; updates HIP `gl*` wrappers to use `real_xp`.
src/math/bcknd/device/hip/math_kernel.h	Adds accumulator-type templates for HIP reduction kernels and introduces `vlsc3_kernel` vs `glsc3_kernel`.
src/math/bcknd/device/opencl/opencl_math.f90	Updates OpenCL Fortran interfaces to return/pass `c_xp` for `gl*` reductions.
src/math/bcknd/device/opencl/math.c	Updates OpenCL `gl*` wrappers to use `real_xp` buffers and return `real_xp`.
src/math/bcknd/device/opencl/math_kernel.cl	Updates OpenCL reduction kernels to write `real_xp` intermediates and accumulate with `real_xp`.
CHANGELOG.md	Documents the move to extended precision reductions in `device_math`.

njansson · 2026-05-21T08:58:26Z

This will unfortunately break on apple silicon, where only single is supported.

How about we ifdef that typedef and retain real_xp as real if we are on macOS ?

something like

#ifdef __APPLE__
 typedef real real_xp;
#else
typedef double real_xp;
#endif

njansson · 2026-05-21T08:59:34Z

This will unfortunately break on apple silicon, where only single is supported.
How about we ifdef that typedef and retain real_xp as real if we are on macOS ?

something like
#ifdef __APPLE__
 typedef real real_xp;
#else
typedef double real_xp;
#endif 

also we do have a header where this can be set in the device folder

vbaconnet · 2026-06-15T08:36:05Z

Thank you for doing this important work! I have two comments/questions:

Why only device_math and not host math? :D
Would there be instances where we would not want to truncate the reduction from xp to rp? In other words t keep the reduction result in xp? It depends on the use case I guess.

timofeymukha · 2026-06-15T08:44:19Z

Thank you for doing this important work! I have two comments/questions:

Why only device_math and not host math? :D

Would there be instances where we would not want to truncate the reduction from xp to rp? In other words, to keep the reduction result in xp? It depends on the use case I guess.

On the host, it is already implemented! Not 100% sure about the second point, but typically I would say that if you will use the values in rp computations down the line, there is not a lot of point in retaining one scalar in xp. But there may be exceptions, I guess.

vbaconnet · 2026-06-15T09:52:18Z

On the host, it is already implemented!

Indeed, my bad..

njansson · 2026-06-22T06:16:36Z

 */
 template< typename T >
-__global__ void glsc3_kernel(const T * a,
+__global__ void vlsc3_kernel(const T * a,


glsum cuda

20484eb

timofeymukha requested a review from njansson May 20, 2026 13:34

timofeymukha added 5 commits May 20, 2026 16:37

allocator

5fb9055

CUDA gl*

9280647

HIP + dedicated vlsc3

08eb5ef

Changelog

4adce79

OpenCL WIP

26ed13f

timofeymukha added 3 commits May 21, 2026 11:13

Add template paramter

d08db90

Use shared buffers in OpenCL

9d7ce1e

OpenCL

d538682

timofeymukha changed the title ~~Extended precision for reduction operation in device_math (CUDA and HIP)~~ Extended precision for reduction operation in device_math May 21, 2026

timofeymukha requested a review from Copilot May 21, 2026 08:49

timofeymukha added the enhancement New feature or request label May 21, 2026

Copilot started reviewing on behalf of timofeymukha May 21, 2026 08:50 View session

timofeymukha marked this pull request as ready for review May 21, 2026 08:52

Copilot AI reviewed May 21, 2026

View reviewed changes

timofeymukha added 3 commits May 21, 2026 13:09

Changelog and kernel release

c94d184

APPLE guard

406d6c8

add n<=0 guards

242b051

timofeymukha requested a review from timfelle May 21, 2026 10:30

timfelle reviewed May 21, 2026

View reviewed changes

Comment thread src/math/bcknd/device/opencl/math_kernel.cl Outdated

timfelle and others added 3 commits May 28, 2026 16:05

add apple guard

cfdf5fc

add substitution of xp

8bd34aa

Fix opencl

8b5e98a

timfelle approved these changes Jun 1, 2026

View reviewed changes

Merge branch 'develop' into feature/xp_device_reductions

7313075

njansson added this to Neko v1.1.0 release Jun 11, 2026

github-project-automation Bot moved this to 📋 Todo in Neko v1.1.0 release Jun 11, 2026

njansson enabled auto-merge June 15, 2026 14:11

timofeymukha added 2 commits June 15, 2026 17:12

Merge branch 'develop' into feature/xp_device_reductions

8794b3f

Merge branch 'develop' into feature/xp_device_reductions

5688c03

timfelle moved this from 📋 Todo to 🏗 In progress in Neko v1.1.0 release Jun 16, 2026

njansson reviewed Jul 1, 2026

View reviewed changes

Uh oh!

Conversation

timofeymukha commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njansson commented May 20, 2026

Uh oh!

timofeymukha commented May 21, 2026

Uh oh!

njansson commented May 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njansson commented May 21, 2026

Uh oh!

njansson commented May 21, 2026

Uh oh!

Uh oh!

vbaconnet commented Jun 15, 2026

Uh oh!

timofeymukha commented Jun 15, 2026

Uh oh!

vbaconnet commented Jun 15, 2026

Uh oh!

njansson Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

timofeymukha commented May 20, 2026 •

edited

Loading