Minor hardcoded values with device built-in `warpSize` by abagusetty · Pull Request #784 · pyscf/gpu4pyscf

abagusetty · 2026-06-12T16:58:09Z

Replace the hardcoded values with device built-ins
Fix a test to replace cupy.linalg to numpy.linalg

Pure CUDA refactor that substitutes hardcoded warp-size assumptions in device-side kernels with the CUDA/HIP `warpSize` device-runtime built-in. No HIP-specific code is introduced; all changes remain correct on NVIDIA (warpSize=32) and become portable to architectures where the warp/wavefront size differs (e.g. AMD CDNA gfx9xx with wavefront=64). This is a prerequisite for the companion `hip` branch (a HIP/ROCm build) to produce correct results on wavefront-64 AMD GPUs. The two branches are intentionally independent so the warpSize cleanup can land on master without coupling to the HIP build system. Device-side substitutions (warpSize is a runtime constant; safe in __global__ / __device__ contexts): - gpu4pyscf/lib/ecp/common.cu: classic block_reduce pattern -- shared-memory reduce loop bound (s >= 32 -> s >= warpSize), warp-leader gate (tid < 32 -> tid < warpSize), and the intra-warp __shfl_down_sync sequence (16/8/4/2/1 unrolled -> for offset = warpSize/2; offset > 0; offset >>= 1). - gpu4pyscf/lib/gvhf-md/{unrolled_md_j.cu, unrolled_md_j_4dm.cu, md_contract_j.cu}: 25 occurrences of `lane_id = X % 32` -> `lane_id = X % warpSize`. - gpu4pyscf/lib/multigrid/multigrid_v2/evaluation.cuh: warp-stride reduction loop (_o = 16 -> _o = warpSize/2), warp-leader test (tid & 31 -> tid % warpSize), warp index (tid >> 5 -> tid / warpSize). Centralized the constexpr n_warps divisor into a named constant. Host-side / constexpr substitutions (warpSize is device-only so a named constant is used instead): - gpu4pyscf/lib/gvhf-md/md_j_driver.cu: qd_offset_for_threads 32-element alignment now uses a named WARP_SIZE_HOST constant derived from the assumed warp width. - gpu4pyscf/lib/pbc/{int3c2e_create_tasks.cuh, int3c2e_create_tasks_o1.cuh, int3c2e.cuh, ft_ao.cuh, ft_ao.cu}, gpu4pyscf/lib/multigrid/multigrid.cuh: existing `#define WARP_SIZE 32` directives wrapped in `#ifndef WARP_SIZE` so the build system can override the value (e.g. -DWARP_SIZE=64) for future wider-wavefront targets. All changes are zero-cost on NVIDIA and unlock correct portability.

Abhishek Bagusetty added 3 commits June 11, 2026 21:29

Replace some of the hardcoded values with macros

68bc751

fix a test case with cupy.linalg -> numpy.linalg

3fb415f

abagusetty marked this pull request as ready for review June 12, 2026 17:00

Replace a np.int32 initialization with device-resident cp.zeros

e64378c

sunqm merged commit 4a27d6d into pyscf:master Jun 13, 2026
5 of 6 checks passed

abagusetty deleted the hip-warpsize branch June 13, 2026 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor hardcoded values with device built-in `warpSize`#784

Minor hardcoded values with device built-in `warpSize`#784
sunqm merged 4 commits into
pyscf:masterfrom
abagusetty:hip-warpsize

abagusetty commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abagusetty commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants