ci: add paired-release validation gate workflow (AITER+ATOM matrix)#3578
Open
sunway513 wants to merge 4 commits into
Open
ci: add paired-release validation gate workflow (AITER+ATOM matrix)#3578sunway513 wants to merge 4 commits into
sunway513 wants to merge 4 commits into
Conversation
Multi-model GSM8K gate that dispatches against a paired AITER+ATOM
container image. Runs each of {dsr1, minimax, qwen3, glm5, kimi} on its
own DO MI350X 8-GPU runner in parallel. Output convention:
MODEL_FLEX_EXTRACT=<value> for each job, gated on per-model threshold.
Designed for the AITER+ATOM bi-weekly paired-release pilot.
First test target: rocm/atom-dev:atom0.1.4-rc0-aiter0.1.15-rc0.
…del in job-level if)
DO MI350X runners do not yet have /models populated for the paired-release
matrix (only model_path layout assumed by ATOM repos). Without this fallback
4/5 model jobs fail with HFValidationError on first dispatch.
The job now:
- checks if ${MODEL_PATH} exists in the container
- if yes: uses it as before (-v /models:/models still mounted)
- if no: strips the /models/ prefix and uses the HF repo id, requiring
HF_TOKEN to be set (already passed via secrets.HF_TOKEN_TEST)
First-pull cost is the HF download time. Subsequent runs on the same runner
benefit from HF cache.
Refs: paired-release-gate pilot post-mortem, task #79 (DO runner debug v0.2).
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new manual GitHub Actions workflow to validate paired AITER+ATOM release candidate containers by running a 5-model GSM8K accuracy gate in parallel on dedicated MI350X runners, then summarizing results.
Changes:
- Introduces
.github/workflows/paired-release-gate.yaml(workflow_dispatch) to run one job per model in a matrix onlinux-aiter-do-mi350x-8. - Runs an ATOM OpenAI-compatible server in the container and evaluates
lm_eval gsm8k --num_fewshot 3, gating against per-model thresholds. - Adds a HuggingFace fallback path when a local
/modelscache entry is missing (but currently has a correctness bug in the fallback condition).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+22
to
+25
| models: | ||
| description: 'Comma-separated subset (or "all")' | ||
| required: false | ||
| default: 'all' |
Comment on lines
+27
to
+31
| concurrency: | ||
| group: ${{ github.workflow }}-${{ inputs.container }} | ||
| cancel-in-progress: false | ||
|
|
||
| jobs: |
Comment on lines
+88
to
+103
| if [ -f "/etc/podinfo/gha-render-devices" ]; then | ||
| DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices) | ||
| else | ||
| DEVICE_FLAG="--device /dev/dri" | ||
| fi | ||
| docker run -dt \ | ||
| --device=/dev/kfd $DEVICE_FLAG \ | ||
| --ipc=host --group-add video --shm-size=64G --privileged \ | ||
| --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \ | ||
| --ulimit memlock=-1 --ulimit stack=67108864 \ | ||
| --network=host \ | ||
| ${{ matrix.env_vars }} \ | ||
| -e HF_TOKEN="${HF_TOKEN:-${{ secrets.HF_TOKEN_TEST }}}" \ | ||
| -v /models:/models \ | ||
| --name ${INSTANCE} \ | ||
| --entrypoint=sleep ${CONTAINER_IMAGE} infinity |
Comment on lines
+121
to
+123
| EFFECTIVE_MODEL='${MODEL_PATH}' | ||
| if [ ! -d \"${MODEL_PATH}\" ] && [ ! -e \"${MODEL_PATH}/config.json\" ]; then | ||
| HF_REPO=\$(echo '${MODEL_PATH}' | sed 's|^/models/||') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
.github/workflows/paired-release-gate.yaml— aworkflow_dispatchvalidation gate for the AITER+ATOM bi-weekly paired release pilot (cadence proposal, pilot underway with AITER v0.1.15-rc0 + ATOM v0.1.4-rc0).Each dispatch:
rocm/atom-dev:atomX.Y-rcN-aiterX.Y-rcNcontainerlinux-aiter-do-mi350x-8runners (DO MI350X pool, 6 nodes x 8 GPUs)lm_eval gsm8k --num_fewshot 3Wall time: ~20 min for 5-model gate vs ~50-60 min serial on a single node.
Container contract
Image must ship: ATOM source + AITER wheel + matching
triton/flydsl. Workflow does NOT install AITER from source.Verified container for first run:
rocm/atom-dev:atom0.1.4-rc0-aiter0.1.15-rc0(pushed to Docker Hub 2026-06-05).Commit history on this branch
55db09a9— initial workflow scaffold181a6980— yaml fix (drop unsupportedmatrix.modelin job-levelif)3f120feb— override container entrypoint tosleep(image bakedentrypoint=bash)728fbb65— HuggingFace fallback when/modelscache missing on DO runners (this addresses the most common first-dispatch failure mode)Known limitations (tracked for v0.2)
First end-to-end dispatch (run 27068415810, before commit
728fbb65) surfaced three categories of failure on DO runners:/modelscache (4/5 jobs) — DO runners did not have/models/{repo}/{model}populated. Fixed in this PR via HF fallback (commit728fbb65). Requiressecrets.HF_TOKEN_TESTto be configured on the repo.hipErrorIllegalStateon container start — needs runner-side GPU init investigation. Tracked separately.store->get('0')600s wait) — multi-process bootstrap config on DO runners. Tracked separately.(2) and (3) require ATOM team + DO runner ops coordination and are not in scope for this PR.
Test plan
python -c \"import yaml; yaml.safe_load(...)\")models=skipstyle not implemented; dispatch on a known-good container expected to be the first real validation)rocm/atom-dev:atom0.1.4-rc0-aiter0.1.15-rc0onceHF_TOKEN_TESTsecret is in place, confirm at least one model passes end-to-endWhy merge now
The scaffold + HF fallback is enough to be the starting point for the v0.1.5 / v0.1.16 paired-release cycle (~2026-06-22). Iterations (Kimi GPU init, DSR1 RCCL) land as follow-up PRs. Keeping the scaffold on
mainlets ATOM team + RE team converge on the same gate definition.cc @valarLip @plehnert @ROCmRichardLi