[CI] Add hip-tests kernel suite in SPIRV mode#223
Open
lamb-j wants to merge 25 commits into
Open
Conversation
Adds a test_rocm_examples job that compiles ROCm/rocm-examples HIP-Basic/hello_world via PR clang with `--offload-arch=amdgcnspirv` and runs the resulting binary on a gfx942 runner. This exercises the SPIRV translator end-to-end through real HIP compile + JIT kernel load, catching translator/codegen bugs that pass lit but fail at runtime. Build job changes: - After Comgr build, install comgr + device-libs into a staging/ tree via cmake --install. clang's --rocm-path expects installed-tree layout, not build-dir layout. - Clone ROCm/rocm-systems (pinned) and build rocr-runtime + clr (libhsa-runtime64.so + libamdhip64.so) against the just-built clang, installing into the same staging/ tree. - Clone ROCm/rocm-cmake (pinned) for the ROCM cmake modules CLR needs. - Disable kpack, rocprofiler-register, and host-hipcc detection in CLR (they're optional and not part of the SPIRV CI scope). - Strip and tar steps extended to include staging/. Why in-pipeline runtime build (vs prebuilt nightly): TheRock's public nightly-tarball bucket only publishes consumer (dgpu) families on a daily cadence; gfx94X / gfx950 (dcgpu) tarballs stopped Oct 2025, and TheRock's own multi-arch CI for those families builds the runtime in-pipeline for the same reason. Building in-pipeline also ensures the runtime tracks the same llvm/clang amd-staging tip we're testing. Test job: - runs-on: linux-gfx942-1gpu-ossci-rocm (matches existing TheRock multi-arch CI runner choice). - Compiles hello_world via PR clang + --rocm-path=$STAGING; explicit -lhsa-runtime64 needed since it's not pulled transitively from libamdhip64.so during exe link. - Runs binary with LD_LIBRARY_PATH=$STAGING/lib. - Greps for fixed host + device greeting lines. - Informational initially; promote to required once stable. Pinned SHAs (bump deliberately, ~weekly): - ROCm/rocm-systems: 8bb3b73c (develop, 2026-05-15) - ROCm/rocm-cmake: 4d391d7b (develop, 2026-05-12) - ROCm/rocm-examples: b4ee9992 (develop, 2025-02-21) Estimated wall-time impact: existing build ~14 min + ~7-10 min new (rocr ~2 min, clr ~5-8 min) -> ~22-25 min total. Inside 120-min timeout. Test job adds one GPU-runner job, runs ~1-2 min.
ROCR-Runtime's image library compiles OpenCL "blit" bitcode kernels for ~30 gfx targets via add_custom_command(... clang ...). Those clang invocations don't get --rocm-path forwarded; in the manylinux build container there's no /opt/rocm default to fall back to, so device-libs aren't found. Setting ROCM_PATH=$PWD/staging points clang's auto-detection at the staging tree where device-libs were installed in the previous step.
clang-linker-wrapper does its own library lookup before invoking the host linker; it doesn't pick up -L flags from the clang driver, so -lamdhip64 fails to resolve on the GPU runner where there's no system ROCm. Set LIBRARY_PATH=$STAGING/lib so the wrapper finds it in the same tree the host linker is searching. Add -Wl,-rpath,$STAGING/lib so the binary can dlopen the runtime libs without depending on LD_LIBRARY_PATH being set at exec time.
Helps diagnose why clang-linker-wrapper can't find -lamdhip64 even with LIBRARY_PATH set. Temporary; revert once we know the fix.
CMake's GNUInstallDirs picks lib64 on the manylinux2014 container (64-bit Red Hat heritage). clang's --rocm-path detection looks for libraries at <prefix>/lib, so libamdhip64.so + libhsa-runtime64.so landed where the HIP driver couldn't see them. Set CMAKE_INSTALL_LIBDIR=lib on both ROCR-Runtime and CLR configures to force the layout clang expects. Also drop the debug step and LIBRARY_PATH workaround now that the real cause is identified.
Helps determine whether the kernel launches and whether sync returns cleanly when the device printf output is missing. Temporary.
HIP runtime calls Comgr at hipLaunchKernel time to translate the embedded SPIR-V fatbin to AMDGPU code for the actual GPU. Comgr searches for the translator at <libamd_comgr.so>/../bin/amd-llvm-spirv. The translator binary is built in build/bin/ but wasn't being copied into staging/, so Comgr couldn't find it at JIT time, hipLaunchKernel returned hipErrorInvalidImage, and the kernel silently never ran. hipDeviceSynchronize then returned success (nothing to sync) so the binary exited cleanly with only the host-side print output. Copy amd-llvm-spirv + the llvm-spirv symlink into staging/bin/ after the CLR install step.
Comgr looks for amd-llvm-spirv by name; the llvm-spirv symlink is for downstream tools that expect the upstream name and isn't needed for the HIP runtime JIT path.
Comgr's 'Failed to compile spirv to reloc' is opaque — the underlying error stays buried unless logs are redirected to stderr. Also list staging/bin to confirm the translator is actually there at runtime.
AMD_COMGR_REDIRECT_LOGS alone surfaced no extra detail for the 'Failed to compile spirv to reloc' failure. Add: - ROCM_PATH=$STAGING so Comgr finds device-libs in staging instead of falling back to /opt/rocm - AMD_COMGR_SAVE_TEMPS=/tmp/comgr to dump intermediate files showing how far the SPIR-V → reloc pipeline got - Post-run listing of /tmp/comgr/ contents
Match the working invocation from ROCm/RocmCIForSPIRV spirv_external_llvm.sh. The critical addition is --offload-new-driver, which routes through the HIP offload codepath that produces a fatbin HIP runtime can JIT for amdgcnspirv. Without it the runtime hits "Failed to compile spirv to reloc" because the bundle format doesn't match what hip_fatbin.cpp expects. Also added: - --rocm-device-lib-path explicit (auto-detect doesn't always find the staged amdgcn/bitcode dir) - -D__HIP_PLATFORM_AMD__ to match - -Wl,--unresolved-symbols=ignore-in-shared-libs (libamd_comgr embeds its own LLVM symbols)
HIP runtime reaches Comgr and Comgr's COMPILE_SPIRV_TO_RELOCATABLE action fails. Need to verify what's actually in the binary's fatbin section — SPIR-V with the right target string, or something else (LLVM IR? Wrong target name?). Add a step that lists bundle entries via clang-offload-bundler.
Global env var forces Comgr to log its internal actions without HIP needing to enable per-action logging. Targets the silent AMD_COMGR_ACTION_COMPILE_SPIRV_TO_RELOCATABLE failure at JIT time.
The gfx942 ossci-rocm runner is persistent (no container), so state from prior runs can leak in. Two things to investigate: - ~/.cache/comgr / ~/.amd/comgr — Comgr's JIT cache. A cached failed compile from a prior run could reproduce the same failure even after we've fixed the inputs. RocmCIForSPIRV's clean_all explicitly wipes these for the same reason. - /opt/rocm — if a system ROCm install is present, HIP runtime could be silently resolving against it instead of (or alongside) our staged tree. Add a probe step that lists both, then wipes the Comgr caches before the Run step.
Probe step now shows which libamdhip64 + libamd_comgr are actually resolving at runtime, and the rpath/runpath / NEEDED entries of our staged libs. /opt/rocm-6.4.1 exists on the runner — if the staged libs end up pulling Comgr from there, our COMPILE_SPIRV_TO_RELOCATABLE call lands in a stale Comgr API.
The dynamic loader was resolving libamd_comgr.so.3 from /opt/rocm-6.4.1/lib (a system ROCm install on the GPU runner) instead of from our staged tree, because Comgr's install rules apparently don't always emit the libamd_comgr.so.<MAJOR> SONAME symlink in staging/lib. Without that symlink, LD_LIBRARY_PATH=staging/lib has nothing to match against and falls through to ld.so.cache, which hands back the old system Comgr. End result: our libamdhip64 (which calls the new AMD_COMGR_ACTION_COMPILE_SPIRV_TO_RELOCATABLE) ends up linked against 6.4.1's libamd_comgr that doesn't recognize that action — JIT fails with INVALID_ARGUMENT. Backfill the SONAME + dev symlinks after install, and log the lib directory before/after so we can confirm what's there.
Comgr's install was landing in staging/lib64 because the manylinux container's GNUInstallDirs defaults to lib64. Even though the symlink chain (libamd_comgr.so / .so.3 / .so.3.3.0) was created, it was in the wrong directory — LD_LIBRARY_PATH=staging/lib had no comgr to match, so the runtime resolver fell through to /opt/rocm-6.4.1's older Comgr (with a stale API that doesn't recognize COMPILE_SPIRV_TO_RELOCATABLE). Add -DCMAKE_INSTALL_LIBDIR=lib to Comgr's configure (matching what we already did for ROCR + CLR). Revert the buggy SONAME backfill — no longer needed once Comgr installs to the right place.
Replace the single hello-world compile+run with loop-based steps covering the same SPIR-V-eligible example sets that ROCm/RocmCIForSPIRV's spirv_external_llvm.sh builds: - SIMPLE_EXAMPLES (16): HIP-Basic single-source examples - APPLICATIONS (6): HIP-Basic + External/ include path - REDUCTION (9): Tutorials/reduction v1-v9 (c++20) Excluded examples (opengl/vulkan interop, hello_world_cuda, hipify, assembly_to_executable, llvm_ir_to_executable, sobel_filter) either need libraries we don't have or use native-AMDGPU code paths incompatible with SPIR-V. Verification: gate on exit code 0 per example (catches crashes, hipError returns, asserts). 60s timeout per example to avoid hangs. Per-example pass/fail reported in step output. Bumped rocm-examples pin from b4ee9992 (develop, 2025-02) to e260595e (amd-staging, 2026-05-28) — develop hadn't moved since Feb, while amd-staging tracks the active examples we want to test against. Bumped job timeout 20m -> 45m to fit 31 examples (~30s build each + ~few seconds run each + GPU runner queue time).
Brings rocm-examples coverage to 34 — full parity with what RocmCIForSPIRV's spirv_external_llvm.sh builds, modulo multi_gpu_data_transfer (needs >1 GPU; this runner is 1gpu) and the math-libs-dependent tier (rocPRIM, hipCUB, monte_carlo_pi). bandwidth + matrix_multiplication live in HIP-Basic but need -I External (new EXTERNAL_EXAMPLES bucket). runtime_compilation needs -lhiprtc at link time; staged as a one-off.
- Bump actions/upload-artifact v4 -> v7, download-artifact v4 -> v8
(reviewer note: v4 is being deprecated June 2; v7/v8 are current).
- Remove the two diagnostic steps from test_rocm_examples that we
added while debugging the JIT failure:
* "Dump fatbin contents" (llvm-objdump / clang-offload-bundler
inspection of hello_world's fatbin)
* "Probe runner state + clean caches" (ldd, readelf, /opt/rocm
listing, comgr cache wipe)
These caught the lib64 vs lib mismatch and the libamd_comgr SONAME
fallthrough to /opt/rocm; both are fixed now (CMAKE_INSTALL_LIBDIR=lib
on Comgr install), so the probes have served their purpose.
- Drop the small "ls staging/lib/libamd_comgr*" diagnostic at the end
of the install step too.
Net -34 lines.
per reviewer suggestion (idubinov). built_examples.txt only contains names of executables that succeeded compile (BUILD_OK append happens after a successful compile), so the [ -x "./$exe" ] check was redundant — trust the list.
Incorporates the good ideas from idubinov's PR #216: Build job: - CMAKE_INSTALL_LIBDIR=lib on LLVM Configure (manylinux defaults to lib64; downstream CMake projects expect lib). - LLVM_ENABLE_RUNTIMES=compiler-rt + COMPILER_RT_BUILD_BUILTINS=ON (all other compiler-rt features off). hip-lang-config.cmake hard-requires libclang_rt.builtins.a for any downstream enable_language(HIP). - builtins added to the ninja build target. - CMAKE_INSTALL_LIBDIR=lib also added to device-libs Configure. Test job: - Now runs inside the same manylinux container as Build with GPU passthrough (--device=/dev/kfd /dev/dri --group-add video). Isolates from /opt/rocm-* and Comgr cache on the persistent host runner; also gives glibc consistency with the built libs. - dnf install numactl-libs (HIP runtime dep). HIP-Basic now built via rocm-examples' own CMakeLists (23 examples) instead of our hand-rolled bash loop (19 examples). The upstream CMake auto-skips SPIR-V-incompatible examples (assembly_to_executable, llvm_ir_to_executable, opengl/vulkan_interop, hello_world_cuda, hipify, sobel_filter) via find_package / GPU_RUNTIME gates. Picks up 3 examples we previously excluded: hip_static_host_library (+ _cxx), hip_module_api, hip_multi_gpu_data_transfer. Applications + Tutorials/reduction stay on the bash loop — their CMakeLists pull in hipcub/rocrand/rocfft find_packages that aren't present in our staged tree. Run step now iterates both examples-build/bin/* (CMake build) and built_examples.txt (bash loop), with hip_cooperative_groups skipped (known s.wait.asynccnt codegen bug). Total coverage: ~32 examples (22 HIP-Basic + 6 Applications + 9 Tutorials/reduction - 1 skipped). Dropped --offload-new-driver (now clang default) and -use-spirv-backend (stick with clang default translator codepath) from Igor's flag set.
It's a real codegen failure ('Cannot select intrinsic
%llvm.amdgcn.s.wait.asynccnt') and the whole point of the
test_rocm_examples job is to keep surfacing this until the AMDGPU
backend gets a lowering pattern. Gating the suite on it is the
correct behavior.
Adds a test_hip_tests job that builds the hip-tests `kernel` catch2 unit suite with -DENABLE_SPIRV=ON and runs it via ctest on a gfx942 GPU runner. Prototype scope: just the kernel unit dir to start; expand to more unit dirs once the SPIR-V baseline is characterized. hip-tests has first-class SPIR-V support: -DENABLE_SPIRV=ON makes its own CMake set --offload-arch=amdgcnspirv and gate out cases that can't run in SPIR-V mode (per-test `if(NOT ENABLE_SPIRV)` blocks). catch2's catch_discover_tests registers each case with ctest, so we just configure -> build KernelTest -> ctest -R Kernel. Reuses the in-pipeline runtime staging tree + manylinux container + GPU passthrough from the rocm-examples job. Same lib/lib64 + libgcc runtime-lib flags. Stacked on the rocm-examples branch (shares the Build job). Will rebase onto amd-staging once #208 lands. Informational initially.
6333f0d to
e19cf07
Compare
hip-tests' project(hiptests LANGUAGES C CXX HIP) compiles host code
with the default system compiler (gcc-toolset-13 in the manylinux
container). We pass --rtlib=libgcc -unwindlib=libgcc in CMAKE_CXX_FLAGS,
which gcc doesn't understand ("unrecognized command-line option
--rtlib=libgcc"), so the project() compiler sanity check fails.
Point CMAKE_C_COMPILER + CMAKE_CXX_COMPILER at our staged clang so
those flags are valid for host code too (matches what the HIP language
already uses via CMAKE_HIP_COMPILER).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
test_hip_testsjob that builds the hip-testskernelcatch2 unit suite with-DENABLE_SPIRV=ONand runs it via ctest on a gfx942 GPU runner. Exercises the SPIRV translator through HIP's own test suite — complements the rocm-examples coverage with structured catch2 test cases.Prototype scope: just the
kernelunit dir to start. Expand to more unit dirs (memory, stream, device, ...) once the SPIR-V baseline is characterized.Why hip-tests is a clean fit
hip-tests has first-class SPIR-V support:
-DENABLE_SPIRV=ONmakes its own CMake:--offload-arch=amdgcnspirvif(NOT ENABLE_SPIRV)blocks)catch2's
catch_discover_testsregisters each case with ctest, so the flow is just: configure →cmake --build --target KernelTest→ctest -R Kernel. No hand-rolled bash loop.Reuse from the rocm-examples PR
This shares the in-pipeline runtime build (rocr-runtime + CLR + Comgr →
staging/), the manylinux container with GPU passthrough, and the lib/lib64 + libgcc runtime-lib flags from #208.Stacking
Based on
users/lambj/spirv-ci-rocm-examples(#208) since it inherits the Build-job changes. Will rebase ontoamd-stagingonce #208 lands. Review #208 first.Pin
ROCm/hip-tests:d01e1f96(develop, 2026-06-03)Status
Informational initially. Once the kernel baseline is green (or known failures characterized), expand unit-dir coverage and consider gating.