[CI] add mi300 pipeline by chaoos · Pull Request #669 · etmc/tmLQCD

chaoos · 2026-03-09T11:56:23Z

This PR adds requirements on the side of tmLQCD for CI/CD testing on the CSCS test system "beverin". This system hosts AMD MI300A GPUs.

The pipeline was changed as follows:

Additonally, one can comment on any PR with

cscs-ci run beverin

To run the pipeline on MI300 nodes at CSCS. The test is equivalent to the GH200 pipeline test.

Both pipelines (GH200 and MI300) where changed to have 3 stages now:

prepare: builds the base image with all dependencies in it
build: build tmLQCD for the PR, and QUDA from its newest head commit (this ALWAYS rebuilds QUDA and tmQLCD for every invocation of the pipeline bypassing all build caches).
test: run the HMC test as before

Furthermore both comments can be supplemented by variables which propagate to the pipeline jobs. For instance:

cscs-ci run beverin;VARIABLE=value

Available variables to set are:

QUDA_GIT_REPO: the git repository URL to use as source for the QUDA spack build in the build stage (defaults to https://github.com/lattice/quda.git)
QUDA_GIT_BRANCH: the git branch (defaults to develop)
QUDA_GIT_COMMIT: the git commit (defaults to the current head commit of QUDA_GIT_BRANCH)

This functionality is there to be able to test the whole pipeline against a certain QUDA branch that is active in development (possibly on a fork). For instance:

cscs-ci run beverin;QUDA_GIT_BRANCH=feature/prefetch2

will pull the most recent commit of the feature/prefetch2 branch of QUDA instead and compile and run against that one, or

cscs-ci run beverin;QUDA_GIT_COMMIT=c9308c9a20cd7a68f8f45f20c7141b83dbc7f44a

will checkout a certain commit hash instead of the most recent one.

This works on the GH200 as well as on the MI300 pipeline, but not the github actions pipelines.

TODO:

minimal dependency list in environment.yaml.
make quda@develop work in the spack spec even through develop is an evolving target.
merge CMake support #664 into feature/cicd-mi300
merge CMake support #664 into master
adopt to cmake build
Remove echo "VARIABLE = $VARIABLE" and other noise in all yaml files
Adjust GH200 pipeline to match MI300A pipeline
Add memory usage recording, see [CI] add mi300 pipeline #669 (comment)

chaoos · 2026-03-09T11:57:22Z

cscs-ci run beverin

chaoos · 2026-03-09T12:06:15Z

This currently fails due to missing access to the beverin test system at CSCS.

chaoos · 2026-03-10T17:23:03Z

A manual build using quda branch feature/prefetch2 did work, see https://cicd-ext-mw.cscs.ch/ci/pipeline/results/3690753405420143/64239695/2375574724?iid=1711

I used this command on beverin:

uenv build .ci/uenv-recipes/tmlqcd/beverin-mi300 tmlqcd/quda-prefetch2@beverin%mi300

with this spack spec for quda;

  specs:
  - "quda@git.feature/prefetch2 +qdp +multigrid +twisted_clover +twisted_mass"

The image is available in the service namespace of CSCSs uenv registry:

$ uenv image find service::
uenv                                       arch   system   id                size(MB)  date
tmlqcd/quda-prefetch2:2375574724           mi300  beverin  8f0acefe49988d34   3,857    2026-03-10

But the gcc compiler in the uenv is broken 😵:

$ uenv start tmlqcd --view=default
$ gcc --version
Illegal instruction (core dumped)

mtaillefumier · 2026-03-11T01:56:21Z

I learned that the mi300 cluster uses a different authentification mechanism that's why it is failing.

mtaillefumier · 2026-03-11T02:04:28Z

cscs-ci run beverin

mtaillefumier · 2026-03-11T09:06:29Z

$ uenv image find service::
uenv                                       arch   system   id                size(MB)  date
tmlqcd/quda-prefetch2:2375574724           mi300  beverin  8f0acefe49988d34   3,857    2026-03-10

But the gcc compiler in the uenv is broken 😵:

$ uenv start tmlqcd --view=default
$ gcc --version
Illegal instruction (core dumped)

it is a sign that the code was compiled on mi300 and executed on mi250 nodes. Donig the reverse would work.

mtaillefumier · 2026-03-11T09:06:45Z

$ uenv image find service::
uenv                                       arch   system   id                size(MB)  date
tmlqcd/quda-prefetch2:2375574724           mi300  beverin  8f0acefe49988d34   3,857    2026-03-10

But the gcc compiler in the uenv is broken 😵:

$ uenv start tmlqcd --view=default
$ gcc --version
Illegal instruction (core dumped)

it is a sign that the code was compiled on mi300 and executed on mi250 nodes. Doing the reverse would work.

mtaillefumier · 2026-03-11T09:07:04Z

$ uenv image find service::
uenv                                       arch   system   id                size(MB)  date
tmlqcd/quda-prefetch2:2375574724           mi300  beverin  8f0acefe49988d34   3,857    2026-03-10

But the gcc compiler in the uenv is broken 😵:

$ uenv start tmlqcd --view=default
$ gcc --version
Illegal instruction (core dumped)

it is a sign that the code was compiled on mi300 and executed on mi250 nodes. Doing the reverse would work.

chaoos · 2026-03-11T10:10:10Z

it is a sign that the code was compiled on mi300 and executed on mi250 nodes. Doing the reverse would work.

I see that makes sense. Apparently, I started the uenv on the login node which has mi200s. I did start now on a compute node with mi300s and the compiler seems to work. Next problem is that cmake's C compiler test fails:

configure:2981: $? = 1
configure:3001: checking whether the C compiler works
configure:3023: /user-environment/env/default/bin/mpicc -O3 -fopenmp -mtune=neoverse-v2 -mcpu=neoverse-v2  -fopenmp conftest.c  >&5
gcc: warning: '-mcpu=' is deprecated; use '-mtune=' or '-march=' instead
cc1: error: bad value 'neoverse-v2' for '-mtune=' switch
cc1: note: valid arguments to '-mtune=' switch are: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 canno
nlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids emeraldrapids alderlake raptorlake meteorlake graniterapids graniterapids-d arrowlake arrowlak
e-s lunarlake pantherlake bonnell atom silvermont slm goldmont goldmont-plus tremont gracemont sierraforest grandridge clearwaterforest knl knm intel x86-64 eden-x2 nano nano-1000 nano-2000
 nano-3000 nano-x2 eden-x4 nano-x4 lujiazui yongfeng k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 znv
er4 znver5 btver1 btver2 generic native

It seems to me that thorugh the spack process and uenv packaging the compiler uses the neoverse flags for the GH200 node instead of -march=znver4 -mtune=znver4 for the mi300 CPUs. I have to see where this gets injected.

chaoos · 2026-03-11T10:21:43Z

cscs-ci run beverin

Add F7T_CLIENT_ID and F7T_CLIENT_SECRET variables for build stage.

chaoos · 2026-03-11T10:31:08Z

cscs-ci run beverin

chaoos · 2026-03-11T11:10:12Z

cscs-ci run default

mtaillefumier · 2026-03-11T11:11:31Z

cscs-ci run default

mtaillefumier · 2026-03-11T11:12:13Z

cscs-ci run beverin

mtaillefumier · 2026-03-11T11:26:21Z

cscs-ci run beverin

fix typo

mtaillefumier · 2026-03-11T11:36:33Z

cscs-ci run beverin

mtaillefumier · 2026-03-11T11:39:25Z

cscs-ci run beverin

chaoos · 2026-03-11T12:10:41Z

Status update: compiling on the mi300 node works, but running the code fails.

Allocate an mi300 node on beverin:

salloc --nodes=1 --time=01:00:00 --partition=mi300 --gpus-per-node=4

Interactive shell on the compute node for compilation:

srun --uenv=tmlqcd --view=default --pty bash

Compile tmlqcd on the compute node against quda in the uenv:

export CFLAGS="-O3 -fopenmp -mtune=znver4 -mcpu=znver4"
export CXXFLAGS="-O3 -fopenmp -mtune=znver4 -mcpu=znver4"
export LDFLAGS="-fopenmp"
export CC="$(which mpicc)"
export CXX="$(which mpicxx)"
mkdir -p install_dir
autoconf
./configure \
  --enable-quda_experimental \
  --enable-mpi \
  --enable-omp \
  --with-mpidimension=4 \
  --enable-alignment=32 \
  --with-qudadir="/user-environment/env/default" \
  --with-limedir="/user-environment/env/default" \
  --with-lemondir="/user-environment/env/default" \
  --with-lapack="-lopenblas -L/user-environment/env/default/lib" \
  --with-hipdir="/user-environment/env/default/lib" \
  --prefix="$(pwd)/install_dir"
make
make install

Run on the login node:

srun --uenv=tmlqcd --view=default -n 4 ./install_dir/bin/hmc_tm -f doc/sample-input/sample-hmc-quda-cscs-beverin.input

fails with:

# QUDA: ERROR: hipStreamCreateWithPriority(&streams[i], hipStreamDefault, greatestPriority) returned out of memory
 (/tmp/anfink/spack-stage/spack-stage-quda-git.feature_prefetch2_1.0.0-git.7857-e4a5b7x7tczfsshsilrkl2hrmqvgqkam/spack-src/lib/targets/hip/device.cpp:116 in create_context())
 (rank 3, host nid002920, quda_api.cpp:60 in void quda::target::hip::set_runtime_error(hipError_t, const char *, const char *, const char *, const char *, bool)())
# QUDA:        last kernel called was (name=,volume=,aux=)
# QUDA:        last tune param used was block=(64,1,1), grid=(1,1,1), shared_bytes=0, shared_carve_out=0, aux=(1,1,1,1)

The mapping of the 4 processes to the 4 GPUs is correct.

mtaillefumier · 2026-03-11T12:21:52Z

cscs-ci run beverin

chaoos · 2026-03-11T12:24:55Z

For future reference some information about the mi300:
rocminfo.txt
rocm-smi.txt
lscpu.txt
numactl.txt

mtaillefumier · 2026-03-11T12:24:57Z

@chaoos : The ci/cd is properly set. Only I can set it up correctly dues to some restrictions on our side.

chaoos · 2026-03-11T12:27:19Z

@chaoos : The ci/cd is properly set. Only I can set it up correctly dues to some restrictions on our side.

I see, shall we zoom briefly?

kostrzewa · 2026-03-11T12:33:42Z

@mtaillefumier could you please let me know which exact QUDA commit was compiled here? I currently can't compile the develop head commit on Lumi-G (with rocm-6.3.4 or rocm-6.4.4) and have to resort to working with the feature/prefetch2 branch which, however, seems to introduce severe performance regressions.

chaoos · 2026-03-11T12:40:20Z

@kostrzewa The CI now compiles against the develop branch, which will fail. The test I made above is against the feature/prefetch2 branch, which did compile (hip@6.3.3).

I cannot say anything about performance since I cannot run.

chaoos · 2026-04-01T14:14:49Z

cscs-ci run beverin;VARIABLE=test4

chaoos · 2026-04-01T15:29:06Z

cscs-ci run beverin;VARIABLE=test4

chaoos · 2026-04-01T16:04:37Z

cscs-ci run beverin;VARIABLE=test4

chaoos · 2026-04-01T16:48:31Z

cscs-ci run beverin;VARIABLE=test4

chaoos · 2026-04-02T07:48:18Z

cscs-ci run beverin;VARIABLE=test5

chaoos · 2026-04-02T09:21:38Z

cscs-ci run beverin

chaoos · 2026-04-02T09:31:51Z

cscs-ci run default

chaoos · 2026-04-02T10:25:18Z

cscs-ci run default

chaoos · 2026-04-02T11:39:27Z

cscs-ci run default
cscs-ci run beverin

chaoos · 2026-04-02T11:44:00Z

cscs-ci run beverin

chaoos · 2026-04-02T13:48:19Z

OK, this PR is now in a state, where things work and is ready to review. The beverin pipeline only fails because the comparison is too restrictive.
For things added/changed by this PR, please see the updated description above.

@kostrzewa Do you want me to add the memory monitoring (#669 (comment)) to the CI/CD? I'm not sure if you meant that ...

kostrzewa · 2026-04-04T08:37:10Z

@chaoos, sorry, comment #669 (comment) was meant for @mtaillefumier when testing "manually" (i.e., not through the CI pipeline) on CSCS MI250 or MI300 hardware to check if the memory leak that I observe on LUMI-G also occurs on the AMD GPU machines at CSCS

kostrzewa · 2026-04-04T08:40:58Z

@chaoos sorry I completely missed how detailed #669 (comment) was. I'll try to take a look next week but I'm not sure that I will manage. Feel free to merge this in as soon as you are happy with it.

As for the comparison being to restrictive: we might need to follow the same strategy that we followed for the DDalphaAMG workflow. That is: increase the solver precision significantly, regenerate the reference data and then also run the pipelines with stricter solver precisions.

Very very nice to have the HMC on both NVIDIA and AMD GPUs under automatic testing.

ci: added mi300 tests on beverin

faa76d8

Beverin uses a different authentification mechanism

be4b0b1

Add client ID and secret to beverin pipeline

02c82ce

Add F7T_CLIENT_ID and F7T_CLIENT_SECRET variables for build stage.

pass right compiler flags for gh200 or mi300

f6ba11d

Update 00-variables.yml

4737a9d

fix typo

Fix variable names for beverin pipeline secrets

0a5445f

add quda to artifacts, add libquda dir to LD_LIBRARY_PATH in CI

bbe7ae0

Roman Gruber added 2 commits April 1, 2026 18:03

adjust bindir path, and SLURM_LOCALID -> SLURM_PROCID

1cd1c8a

test job timelimit 30min -> 1h

e82eeab

local procid = 0 has to unpack

4c777bf

Roman Gruber added 2 commits April 2, 2026 09:46

more verbosity in CI

3f7d789

adjust beverin input file to match daint input file

9d1ac36

Roman Gruber added 4 commits April 2, 2026 11:03

removed noise in CI output, added after_script for output check

4d2dd2e

moved hmc check into after_script

db0446b

generic build script

9c7a5b6

adjusted permissions

3894d48

adjusted CSCS gh200 pipeline to match mi300 pipeline

97d3110

remove v2.2 from daint repo.yaml

0929eaf

Roman Gruber added 2 commits April 2, 2026 13:36

tidy up and move numdiff check back to script (from after_script)

3427c50

add back brackets and quotes

8a7042d

updated README with new features

da1082a

chaoos requested review from kostrzewa and mtaillefumier April 2, 2026 13:48

Conversation

chaoos commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaoos commented Mar 9, 2026

Uh oh!

chaoos commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaoos commented Mar 10, 2026

Uh oh!

mtaillefumier commented Mar 11, 2026

Uh oh!

mtaillefumier commented Mar 11, 2026

Uh oh!

mtaillefumier commented Mar 11, 2026

Uh oh!

mtaillefumier commented Mar 11, 2026

Uh oh!

mtaillefumier commented Mar 11, 2026

Uh oh!

chaoos commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaoos commented Mar 11, 2026

Uh oh!

chaoos commented Mar 11, 2026

Uh oh!

chaoos commented Mar 11, 2026

Uh oh!

mtaillefumier commented Mar 11, 2026

Uh oh!

mtaillefumier commented Mar 11, 2026

Uh oh!

mtaillefumier commented Mar 11, 2026

Uh oh!

mtaillefumier commented Mar 11, 2026

Uh oh!

mtaillefumier commented Mar 11, 2026

Uh oh!

chaoos commented Mar 11, 2026

Uh oh!

mtaillefumier commented Mar 11, 2026

Uh oh!

chaoos commented Mar 11, 2026

Uh oh!

mtaillefumier commented Mar 11, 2026

Uh oh!

chaoos commented Mar 11, 2026

Uh oh!

kostrzewa commented Mar 11, 2026

Uh oh!

chaoos commented Mar 11, 2026

Uh oh!

chaoos commented Apr 1, 2026

Uh oh!

chaoos commented Apr 1, 2026

Uh oh!

chaoos commented Apr 1, 2026

Uh oh!

chaoos commented Apr 1, 2026

Uh oh!

chaoos commented Apr 2, 2026

Uh oh!

chaoos commented Apr 2, 2026

Uh oh!

chaoos commented Apr 2, 2026

Uh oh!

chaoos commented Apr 2, 2026

Uh oh!

chaoos commented Apr 2, 2026

Uh oh!

chaoos commented Apr 2, 2026

Uh oh!

chaoos commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kostrzewa commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaoos commented Mar 9, 2026 •

edited

Loading

chaoos commented Mar 9, 2026 •

edited

Loading

chaoos commented Mar 11, 2026 •

edited

Loading

chaoos commented Apr 2, 2026 •

edited

Loading

kostrzewa commented Apr 4, 2026 •

edited

Loading