Conversation
|
cscs-ci run beverin |
|
This currently fails due to missing access to the beverin test system at CSCS. |
|
A manual build using quda branch I used this command on beverin: uenv build .ci/uenv-recipes/tmlqcd/beverin-mi300 tmlqcd/quda-prefetch2@beverin%mi300with this spack spec for quda; The image is available in the service namespace of CSCSs uenv registry: $ uenv image find service::
uenv arch system id size(MB) date
tmlqcd/quda-prefetch2:2375574724 mi300 beverin 8f0acefe49988d34 3,857 2026-03-10But the gcc compiler in the uenv is broken 😵: |
|
I learned that the mi300 cluster uses a different authentification mechanism that's why it is failing. |
|
cscs-ci run beverin |
it is a sign that the code was compiled on mi300 and executed on mi250 nodes. Donig the reverse would work. |
it is a sign that the code was compiled on mi300 and executed on mi250 nodes. Doing the reverse would work. |
1 similar comment
it is a sign that the code was compiled on mi300 and executed on mi250 nodes. Doing the reverse would work. |
I see that makes sense. Apparently, I started the uenv on the login node which has mi200s. I did start now on a compute node with mi300s and the compiler seems to work. Next problem is that cmake's C compiler test fails: It seems to me that thorugh the spack process and uenv packaging the compiler uses the neoverse flags for the GH200 node instead of |
|
cscs-ci run beverin |
Add F7T_CLIENT_ID and F7T_CLIENT_SECRET variables for build stage.
|
cscs-ci run beverin |
|
cscs-ci run default |
1 similar comment
|
cscs-ci run default |
|
cscs-ci run beverin |
1 similar comment
|
cscs-ci run beverin |
fix typo
|
cscs-ci run beverin |
1 similar comment
|
cscs-ci run beverin |
|
Status update: compiling on the mi300 node works, but running the code fails. Allocate an mi300 node on beverin: salloc --nodes=1 --time=01:00:00 --partition=mi300 --gpus-per-node=4Interactive shell on the compute node for compilation: srun --uenv=tmlqcd --view=default --pty bashCompile tmlqcd on the compute node against quda in the uenv: export CFLAGS="-O3 -fopenmp -mtune=znver4 -mcpu=znver4"
export CXXFLAGS="-O3 -fopenmp -mtune=znver4 -mcpu=znver4"
export LDFLAGS="-fopenmp"
export CC="$(which mpicc)"
export CXX="$(which mpicxx)"
mkdir -p install_dir
autoconf
./configure \
--enable-quda_experimental \
--enable-mpi \
--enable-omp \
--with-mpidimension=4 \
--enable-alignment=32 \
--with-qudadir="/user-environment/env/default" \
--with-limedir="/user-environment/env/default" \
--with-lemondir="/user-environment/env/default" \
--with-lapack="-lopenblas -L/user-environment/env/default/lib" \
--with-hipdir="/user-environment/env/default/lib" \
--prefix="$(pwd)/install_dir"
make
make installRun on the login node: srun --uenv=tmlqcd --view=default -n 4 ./install_dir/bin/hmc_tm -f doc/sample-input/sample-hmc-quda-cscs-beverin.inputfails with: The mapping of the 4 processes to the 4 GPUs is correct. |
|
cscs-ci run beverin |
|
For future reference some information about the mi300: |
|
@chaoos : The ci/cd is properly set. Only I can set it up correctly dues to some restrictions on our side. |
I see, shall we zoom briefly? |
|
@mtaillefumier could you please let me know which exact QUDA commit was compiled here? I currently can't compile the develop head commit on Lumi-G (with rocm-6.3.4 or rocm-6.4.4) and have to resort to working with the feature/prefetch2 branch which, however, seems to introduce severe performance regressions. |
|
@kostrzewa The CI now compiles against the I cannot say anything about performance since I cannot run. |
|
cscs-ci run beverin;VARIABLE=test4 |
|
cscs-ci run beverin;VARIABLE=test4 |
|
cscs-ci run beverin;VARIABLE=test4 |
|
cscs-ci run beverin;VARIABLE=test4 |
|
cscs-ci run beverin;VARIABLE=test5 |
|
cscs-ci run beverin |
|
cscs-ci run default |
|
cscs-ci run default |
|
cscs-ci run default |
|
cscs-ci run beverin |
|
OK, this PR is now in a state, where things work and is ready to review. The beverin pipeline only fails because the comparison is too restrictive. @kostrzewa Do you want me to add the memory monitoring (#669 (comment)) to the CI/CD? I'm not sure if you meant that ... |
|
@chaoos, sorry, comment #669 (comment) was meant for @mtaillefumier when testing "manually" (i.e., not through the CI pipeline) on CSCS MI250 or MI300 hardware to check if the memory leak that I observe on LUMI-G also occurs on the AMD GPU machines at CSCS |
|
@chaoos sorry I completely missed how detailed #669 (comment) was. I'll try to take a look next week but I'm not sure that I will manage. Feel free to merge this in as soon as you are happy with it. As for the comparison being to restrictive: we might need to follow the same strategy that we followed for the DDalphaAMG workflow. That is: increase the solver precision significantly, regenerate the reference data and then also run the pipelines with stricter solver precisions. Very very nice to have the HMC on both NVIDIA and AMD GPUs under automatic testing. |
This PR adds requirements on the side of tmLQCD for CI/CD testing on the CSCS test system "beverin". This system hosts AMD MI300A GPUs.
The pipeline was changed as follows:
Additonally, one can comment on any PR with
To run the pipeline on MI300 nodes at CSCS. The test is equivalent to the GH200 pipeline test.
Both pipelines (GH200 and MI300) where changed to have 3 stages now:
prepare: builds the base image with all dependencies in itbuild: build tmLQCD for the PR, and QUDA from its newest head commit (this ALWAYS rebuilds QUDA and tmQLCD for every invocation of the pipeline bypassing all build caches).test: run the HMC test as beforeFurthermore both comments can be supplemented by variables which propagate to the pipeline jobs. For instance:
Available variables to set are:
QUDA_GIT_REPO: the git repository URL to use as source for the QUDA spack build in thebuildstage (defaults tohttps://github.com/lattice/quda.git)QUDA_GIT_BRANCH: the git branch (defaults todevelop)QUDA_GIT_COMMIT: the git commit (defaults to the current head commit ofQUDA_GIT_BRANCH)This functionality is there to be able to test the whole pipeline against a certain QUDA branch that is active in development (possibly on a fork). For instance:
will pull the most recent commit of the
feature/prefetch2branch of QUDA instead and compile and run against that one, orwill checkout a certain commit hash instead of the most recent one.
This works on the GH200 as well as on the MI300 pipeline, but not the github actions pipelines.
TODO:
environment.yaml.quda@developwork in the spack spec even throughdevelopis an evolving target.feature/cicd-mi300mastercmakebuildecho "VARIABLE = $VARIABLE"and other noise in all yaml files