Skip to content

[CI] Add hip-tests kernel suite in SPIRV mode#223

Open
lamb-j wants to merge 25 commits into
amd-stagingfrom
users/lambj/spirv-ci-hip-tests
Open

[CI] Add hip-tests kernel suite in SPIRV mode#223
lamb-j wants to merge 25 commits into
amd-stagingfrom
users/lambj/spirv-ci-hip-tests

Conversation

@lamb-j

@lamb-j lamb-j commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a test_hip_tests job that builds the hip-tests kernel catch2 unit suite with -DENABLE_SPIRV=ON and runs it via ctest on a gfx942 GPU runner. Exercises the SPIRV translator through HIP's own test suite — complements the rocm-examples coverage with structured catch2 test cases.

Prototype scope: just the kernel unit dir to start. Expand to more unit dirs (memory, stream, device, ...) once the SPIR-V baseline is characterized.

Why hip-tests is a clean fit

hip-tests has first-class SPIR-V support: -DENABLE_SPIRV=ON makes its own CMake:

  • set --offload-arch=amdgcnspirv
  • gate out cases that can't run in SPIR-V mode (per-test if(NOT ENABLE_SPIRV) blocks)

catch2's catch_discover_tests registers each case with ctest, so the flow is just: configure → cmake --build --target KernelTestctest -R Kernel. No hand-rolled bash loop.

Reuse from the rocm-examples PR

This shares the in-pipeline runtime build (rocr-runtime + CLR + Comgr → staging/), the manylinux container with GPU passthrough, and the lib/lib64 + libgcc runtime-lib flags from #208.

Stacking

Based on users/lambj/spirv-ci-rocm-examples (#208) since it inherits the Build-job changes. Will rebase onto amd-staging once #208 lands. Review #208 first.

Pin

  • ROCm/hip-tests: d01e1f96 (develop, 2026-06-03)

Status

Informational initially. Once the kernel baseline is green (or known failures characterized), expand unit-dir coverage and consider gating.

lamb-j added 24 commits June 4, 2026 14:00
Adds a test_rocm_examples job that compiles ROCm/rocm-examples
HIP-Basic/hello_world via PR clang with `--offload-arch=amdgcnspirv`
and runs the resulting binary on a gfx942 runner. This exercises the
SPIRV translator end-to-end through real HIP compile + JIT kernel
load, catching translator/codegen bugs that pass lit but fail at
runtime.

Build job changes:
- After Comgr build, install comgr + device-libs into a staging/ tree
  via cmake --install. clang's --rocm-path expects installed-tree
  layout, not build-dir layout.
- Clone ROCm/rocm-systems (pinned) and build rocr-runtime + clr
  (libhsa-runtime64.so + libamdhip64.so) against the just-built clang,
  installing into the same staging/ tree.
- Clone ROCm/rocm-cmake (pinned) for the ROCM cmake modules CLR needs.
- Disable kpack, rocprofiler-register, and host-hipcc detection in CLR
  (they're optional and not part of the SPIRV CI scope).
- Strip and tar steps extended to include staging/.

Why in-pipeline runtime build (vs prebuilt nightly): TheRock's public
nightly-tarball bucket only publishes consumer (dgpu) families on a
daily cadence; gfx94X / gfx950 (dcgpu) tarballs stopped Oct 2025, and
TheRock's own multi-arch CI for those families builds the runtime
in-pipeline for the same reason. Building in-pipeline also ensures the
runtime tracks the same llvm/clang amd-staging tip we're testing.

Test job:
- runs-on: linux-gfx942-1gpu-ossci-rocm (matches existing TheRock
  multi-arch CI runner choice).
- Compiles hello_world via PR clang + --rocm-path=$STAGING; explicit
  -lhsa-runtime64 needed since it's not pulled transitively from
  libamdhip64.so during exe link.
- Runs binary with LD_LIBRARY_PATH=$STAGING/lib.
- Greps for fixed host + device greeting lines.
- Informational initially; promote to required once stable.

Pinned SHAs (bump deliberately, ~weekly):
- ROCm/rocm-systems: 8bb3b73c (develop, 2026-05-15)
- ROCm/rocm-cmake:   4d391d7b (develop, 2026-05-12)
- ROCm/rocm-examples: b4ee9992 (develop, 2025-02-21)

Estimated wall-time impact: existing build ~14 min + ~7-10 min new
(rocr ~2 min, clr ~5-8 min) -> ~22-25 min total. Inside 120-min
timeout. Test job adds one GPU-runner job, runs ~1-2 min.
ROCR-Runtime's image library compiles OpenCL "blit" bitcode kernels
for ~30 gfx targets via add_custom_command(... clang ...). Those
clang invocations don't get --rocm-path forwarded; in the manylinux
build container there's no /opt/rocm default to fall back to, so
device-libs aren't found. Setting ROCM_PATH=$PWD/staging points
clang's auto-detection at the staging tree where device-libs were
installed in the previous step.
clang-linker-wrapper does its own library lookup before invoking the
host linker; it doesn't pick up -L flags from the clang driver, so
-lamdhip64 fails to resolve on the GPU runner where there's no
system ROCm. Set LIBRARY_PATH=$STAGING/lib so the wrapper finds it
in the same tree the host linker is searching.

Add -Wl,-rpath,$STAGING/lib so the binary can dlopen the runtime
libs without depending on LD_LIBRARY_PATH being set at exec time.
Helps diagnose why clang-linker-wrapper can't find -lamdhip64 even
with LIBRARY_PATH set. Temporary; revert once we know the fix.
CMake's GNUInstallDirs picks lib64 on the manylinux2014 container
(64-bit Red Hat heritage). clang's --rocm-path detection looks for
libraries at <prefix>/lib, so libamdhip64.so + libhsa-runtime64.so
landed where the HIP driver couldn't see them.

Set CMAKE_INSTALL_LIBDIR=lib on both ROCR-Runtime and CLR configures
to force the layout clang expects. Also drop the debug step and
LIBRARY_PATH workaround now that the real cause is identified.
Helps determine whether the kernel launches and whether sync returns
cleanly when the device printf output is missing. Temporary.
HIP runtime calls Comgr at hipLaunchKernel time to translate the
embedded SPIR-V fatbin to AMDGPU code for the actual GPU. Comgr
searches for the translator at <libamd_comgr.so>/../bin/amd-llvm-spirv.

The translator binary is built in build/bin/ but wasn't being copied
into staging/, so Comgr couldn't find it at JIT time, hipLaunchKernel
returned hipErrorInvalidImage, and the kernel silently never ran.
hipDeviceSynchronize then returned success (nothing to sync) so the
binary exited cleanly with only the host-side print output.

Copy amd-llvm-spirv + the llvm-spirv symlink into staging/bin/ after
the CLR install step.
Comgr looks for amd-llvm-spirv by name; the llvm-spirv symlink is
for downstream tools that expect the upstream name and isn't needed
for the HIP runtime JIT path.
Comgr's 'Failed to compile spirv to reloc' is opaque — the
underlying error stays buried unless logs are redirected to stderr.
Also list staging/bin to confirm the translator is actually there
at runtime.
AMD_COMGR_REDIRECT_LOGS alone surfaced no extra detail for the
'Failed to compile spirv to reloc' failure. Add:
- ROCM_PATH=$STAGING so Comgr finds device-libs in staging instead
  of falling back to /opt/rocm
- AMD_COMGR_SAVE_TEMPS=/tmp/comgr to dump intermediate files showing
  how far the SPIR-V → reloc pipeline got
- Post-run listing of /tmp/comgr/ contents
Match the working invocation from ROCm/RocmCIForSPIRV
spirv_external_llvm.sh. The critical addition is --offload-new-driver,
which routes through the HIP offload codepath that produces a fatbin
HIP runtime can JIT for amdgcnspirv. Without it the runtime hits
"Failed to compile spirv to reloc" because the bundle format doesn't
match what hip_fatbin.cpp expects.

Also added:
- --rocm-device-lib-path explicit (auto-detect doesn't always find the
  staged amdgcn/bitcode dir)
- -D__HIP_PLATFORM_AMD__ to match
- -Wl,--unresolved-symbols=ignore-in-shared-libs (libamd_comgr embeds
  its own LLVM symbols)
HIP runtime reaches Comgr and Comgr's COMPILE_SPIRV_TO_RELOCATABLE
action fails. Need to verify what's actually in the binary's fatbin
section — SPIR-V with the right target string, or something else
(LLVM IR? Wrong target name?). Add a step that lists bundle entries
via clang-offload-bundler.
Global env var forces Comgr to log its internal actions without HIP
needing to enable per-action logging. Targets the silent
AMD_COMGR_ACTION_COMPILE_SPIRV_TO_RELOCATABLE failure at JIT time.
The gfx942 ossci-rocm runner is persistent (no container), so state
from prior runs can leak in. Two things to investigate:

- ~/.cache/comgr / ~/.amd/comgr — Comgr's JIT cache. A cached failed
  compile from a prior run could reproduce the same failure even
  after we've fixed the inputs. RocmCIForSPIRV's clean_all explicitly
  wipes these for the same reason.
- /opt/rocm — if a system ROCm install is present, HIP runtime could
  be silently resolving against it instead of (or alongside) our
  staged tree.

Add a probe step that lists both, then wipes the Comgr caches before
the Run step.
Probe step now shows which libamdhip64 + libamd_comgr are actually
resolving at runtime, and the rpath/runpath / NEEDED entries of our
staged libs. /opt/rocm-6.4.1 exists on the runner — if the staged
libs end up pulling Comgr from there, our COMPILE_SPIRV_TO_RELOCATABLE
call lands in a stale Comgr API.
The dynamic loader was resolving libamd_comgr.so.3 from
/opt/rocm-6.4.1/lib (a system ROCm install on the GPU runner) instead
of from our staged tree, because Comgr's install rules apparently
don't always emit the libamd_comgr.so.<MAJOR> SONAME symlink in
staging/lib. Without that symlink, LD_LIBRARY_PATH=staging/lib has
nothing to match against and falls through to ld.so.cache, which
hands back the old system Comgr.

End result: our libamdhip64 (which calls the new
AMD_COMGR_ACTION_COMPILE_SPIRV_TO_RELOCATABLE) ends up linked against
6.4.1's libamd_comgr that doesn't recognize that action — JIT fails
with INVALID_ARGUMENT.

Backfill the SONAME + dev symlinks after install, and log the lib
directory before/after so we can confirm what's there.
Comgr's install was landing in staging/lib64 because the manylinux
container's GNUInstallDirs defaults to lib64. Even though the symlink
chain (libamd_comgr.so / .so.3 / .so.3.3.0) was created, it was in
the wrong directory — LD_LIBRARY_PATH=staging/lib had no comgr to
match, so the runtime resolver fell through to /opt/rocm-6.4.1's
older Comgr (with a stale API that doesn't recognize
COMPILE_SPIRV_TO_RELOCATABLE).

Add -DCMAKE_INSTALL_LIBDIR=lib to Comgr's configure (matching what
we already did for ROCR + CLR). Revert the buggy SONAME backfill —
no longer needed once Comgr installs to the right place.
Replace the single hello-world compile+run with loop-based steps
covering the same SPIR-V-eligible example sets that
ROCm/RocmCIForSPIRV's spirv_external_llvm.sh builds:

- SIMPLE_EXAMPLES (16): HIP-Basic single-source examples
- APPLICATIONS (6): HIP-Basic + External/ include path
- REDUCTION (9): Tutorials/reduction v1-v9 (c++20)

Excluded examples (opengl/vulkan interop, hello_world_cuda, hipify,
assembly_to_executable, llvm_ir_to_executable, sobel_filter) either
need libraries we don't have or use native-AMDGPU code paths
incompatible with SPIR-V.

Verification: gate on exit code 0 per example (catches crashes,
hipError returns, asserts). 60s timeout per example to avoid hangs.
Per-example pass/fail reported in step output.

Bumped rocm-examples pin from b4ee9992 (develop, 2025-02) to e260595e
(amd-staging, 2026-05-28) — develop hadn't moved since Feb, while
amd-staging tracks the active examples we want to test against.

Bumped job timeout 20m -> 45m to fit 31 examples (~30s build each
+ ~few seconds run each + GPU runner queue time).
Brings rocm-examples coverage to 34 — full parity with what
RocmCIForSPIRV's spirv_external_llvm.sh builds, modulo
multi_gpu_data_transfer (needs >1 GPU; this runner is 1gpu) and
the math-libs-dependent tier (rocPRIM, hipCUB, monte_carlo_pi).

bandwidth + matrix_multiplication live in HIP-Basic but need
-I External (new EXTERNAL_EXAMPLES bucket). runtime_compilation
needs -lhiprtc at link time; staged as a one-off.
- Bump actions/upload-artifact v4 -> v7, download-artifact v4 -> v8
  (reviewer note: v4 is being deprecated June 2; v7/v8 are current).
- Remove the two diagnostic steps from test_rocm_examples that we
  added while debugging the JIT failure:
  * "Dump fatbin contents" (llvm-objdump / clang-offload-bundler
    inspection of hello_world's fatbin)
  * "Probe runner state + clean caches" (ldd, readelf, /opt/rocm
    listing, comgr cache wipe)
  These caught the lib64 vs lib mismatch and the libamd_comgr SONAME
  fallthrough to /opt/rocm; both are fixed now (CMAKE_INSTALL_LIBDIR=lib
  on Comgr install), so the probes have served their purpose.
- Drop the small "ls staging/lib/libamd_comgr*" diagnostic at the end
  of the install step too.

Net -34 lines.
per reviewer suggestion (idubinov). built_examples.txt only contains
names of executables that succeeded compile (BUILD_OK append happens
after a successful compile), so the [ -x "./$exe" ] check was
redundant — trust the list.
Incorporates the good ideas from idubinov's PR #216:

Build job:
- CMAKE_INSTALL_LIBDIR=lib on LLVM Configure (manylinux defaults to
  lib64; downstream CMake projects expect lib).
- LLVM_ENABLE_RUNTIMES=compiler-rt + COMPILER_RT_BUILD_BUILTINS=ON
  (all other compiler-rt features off). hip-lang-config.cmake
  hard-requires libclang_rt.builtins.a for any downstream
  enable_language(HIP).
- builtins added to the ninja build target.
- CMAKE_INSTALL_LIBDIR=lib also added to device-libs Configure.

Test job:
- Now runs inside the same manylinux container as Build with GPU
  passthrough (--device=/dev/kfd /dev/dri --group-add video). Isolates
  from /opt/rocm-* and Comgr cache on the persistent host runner;
  also gives glibc consistency with the built libs.
- dnf install numactl-libs (HIP runtime dep).

HIP-Basic now built via rocm-examples' own CMakeLists (23 examples)
instead of our hand-rolled bash loop (19 examples). The upstream
CMake auto-skips SPIR-V-incompatible examples (assembly_to_executable,
llvm_ir_to_executable, opengl/vulkan_interop, hello_world_cuda,
hipify, sobel_filter) via find_package / GPU_RUNTIME gates. Picks up
3 examples we previously excluded: hip_static_host_library (+ _cxx),
hip_module_api, hip_multi_gpu_data_transfer.

Applications + Tutorials/reduction stay on the bash loop — their
CMakeLists pull in hipcub/rocrand/rocfft find_packages that aren't
present in our staged tree.

Run step now iterates both examples-build/bin/* (CMake build) and
built_examples.txt (bash loop), with hip_cooperative_groups skipped
(known s.wait.asynccnt codegen bug). Total coverage: ~32 examples
(22 HIP-Basic + 6 Applications + 9 Tutorials/reduction - 1 skipped).

Dropped --offload-new-driver (now clang default) and -use-spirv-backend
(stick with clang default translator codepath) from Igor's flag set.
It's a real codegen failure ('Cannot select intrinsic
%llvm.amdgcn.s.wait.asynccnt') and the whole point of the
test_rocm_examples job is to keep surfacing this until the AMDGPU
backend gets a lowering pattern. Gating the suite on it is the
correct behavior.
Adds a test_hip_tests job that builds the hip-tests `kernel` catch2
unit suite with -DENABLE_SPIRV=ON and runs it via ctest on a gfx942
GPU runner. Prototype scope: just the kernel unit dir to start;
expand to more unit dirs once the SPIR-V baseline is characterized.

hip-tests has first-class SPIR-V support: -DENABLE_SPIRV=ON makes its
own CMake set --offload-arch=amdgcnspirv and gate out cases that can't
run in SPIR-V mode (per-test `if(NOT ENABLE_SPIRV)` blocks). catch2's
catch_discover_tests registers each case with ctest, so we just
configure -> build KernelTest -> ctest -R Kernel.

Reuses the in-pipeline runtime staging tree + manylinux container +
GPU passthrough from the rocm-examples job. Same lib/lib64 + libgcc
runtime-lib flags.

Stacked on the rocm-examples branch (shares the Build job). Will
rebase onto amd-staging once #208 lands. Informational initially.
@lamb-j lamb-j force-pushed the users/lambj/spirv-ci-hip-tests branch from 6333f0d to e19cf07 Compare June 4, 2026 21:00
@lamb-j lamb-j changed the base branch from users/lambj/spirv-ci-rocm-examples to amd-staging June 4, 2026 21:00
@lamb-j lamb-j closed this Jun 4, 2026
@lamb-j lamb-j reopened this Jun 4, 2026
hip-tests' project(hiptests LANGUAGES C CXX HIP) compiles host code
with the default system compiler (gcc-toolset-13 in the manylinux
container). We pass --rtlib=libgcc -unwindlib=libgcc in CMAKE_CXX_FLAGS,
which gcc doesn't understand ("unrecognized command-line option
--rtlib=libgcc"), so the project() compiler sanity check fails.

Point CMAKE_C_COMPILER + CMAKE_CXX_COMPILER at our staged clang so
those flags are valid for host code too (matches what the HIP language
already uses via CMAKE_HIP_COMPILER).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant