Skip to content

Minor hardcoded values with device built-in warpSize#784

Merged
sunqm merged 4 commits into
pyscf:masterfrom
abagusetty:hip-warpsize
Jun 13, 2026
Merged

Minor hardcoded values with device built-in warpSize#784
sunqm merged 4 commits into
pyscf:masterfrom
abagusetty:hip-warpsize

Conversation

@abagusetty

Copy link
Copy Markdown
Contributor
  1. Replace the hardcoded values with device built-ins
  2. Fix a test to replace cupy.linalg to numpy.linalg

Abhishek Bagusetty added 3 commits June 11, 2026 21:29
Pure CUDA refactor that substitutes hardcoded warp-size assumptions
in device-side kernels with the CUDA/HIP `warpSize` device-runtime
built-in. No HIP-specific code is introduced; all changes remain
correct on NVIDIA (warpSize=32) and become portable to architectures
where the warp/wavefront size differs (e.g. AMD CDNA gfx9xx with
wavefront=64).

This is a prerequisite for the companion `hip` branch (a HIP/ROCm
build) to produce correct results on wavefront-64 AMD GPUs. The two
branches are intentionally independent so the warpSize cleanup can
land on master without coupling to the HIP build system.

Device-side substitutions (warpSize is a runtime constant; safe in
__global__ / __device__ contexts):
- gpu4pyscf/lib/ecp/common.cu: classic block_reduce pattern --
  shared-memory reduce loop bound (s >= 32 -> s >= warpSize),
  warp-leader gate (tid < 32 -> tid < warpSize), and the
  intra-warp __shfl_down_sync sequence (16/8/4/2/1 unrolled ->
  for offset = warpSize/2; offset > 0; offset >>= 1).
- gpu4pyscf/lib/gvhf-md/{unrolled_md_j.cu, unrolled_md_j_4dm.cu,
  md_contract_j.cu}: 25 occurrences of `lane_id = X % 32` ->
  `lane_id = X % warpSize`.
- gpu4pyscf/lib/multigrid/multigrid_v2/evaluation.cuh: warp-stride
  reduction loop (_o = 16 -> _o = warpSize/2), warp-leader test
  (tid & 31 -> tid % warpSize), warp index (tid >> 5 ->
  tid / warpSize). Centralized the constexpr n_warps divisor
  into a named constant.

Host-side / constexpr substitutions (warpSize is device-only so a
named constant is used instead):
- gpu4pyscf/lib/gvhf-md/md_j_driver.cu: qd_offset_for_threads
  32-element alignment now uses a named WARP_SIZE_HOST constant
  derived from the assumed warp width.
- gpu4pyscf/lib/pbc/{int3c2e_create_tasks.cuh,
  int3c2e_create_tasks_o1.cuh, int3c2e.cuh, ft_ao.cuh, ft_ao.cu},
  gpu4pyscf/lib/multigrid/multigrid.cuh: existing
  `#define WARP_SIZE 32` directives wrapped in `#ifndef WARP_SIZE`
  so the build system can override the value (e.g. -DWARP_SIZE=64)
  for future wider-wavefront targets.

All changes are zero-cost on NVIDIA and unlock correct portability.
@abagusetty abagusetty marked this pull request as ready for review June 12, 2026 17:00
@sunqm sunqm merged commit 4a27d6d into pyscf:master Jun 13, 2026
5 of 6 checks passed
@abagusetty abagusetty deleted the hip-warpsize branch June 13, 2026 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants