ci: add paired-release validation gate workflow (AITER+ATOM matrix) by sunway513 · Pull Request #3578 · ROCm/aiter

sunway513 · 2026-06-06T19:00:46Z

Summary

Adds .github/workflows/paired-release-gate.yaml — a workflow_dispatch validation gate for the AITER+ATOM bi-weekly paired release pilot (cadence proposal, pilot underway with AITER v0.1.15-rc0 + ATOM v0.1.4-rc0).

Each dispatch:

pulls a paired rocm/atom-dev:atomX.Y-rcN-aiterX.Y-rcN container
spawns one job per model (DSR1, MiniMax, Qwen3, GLM-5, Kimi) in parallel on dedicated linux-aiter-do-mi350x-8 runners (DO MI350X pool, 6 nodes x 8 GPUs)
launches ATOM server + runs lm_eval gsm8k --num_fewshot 3
gates each model against its upstream-canonical GSM8K threshold

Wall time: ~20 min for 5-model gate vs ~50-60 min serial on a single node.

Container contract

Image must ship: ATOM source + AITER wheel + matching triton / flydsl. Workflow does NOT install AITER from source.

Verified container for first run: rocm/atom-dev:atom0.1.4-rc0-aiter0.1.15-rc0 (pushed to Docker Hub 2026-06-05).

Commit history on this branch

55db09a9 — initial workflow scaffold
181a6980 — yaml fix (drop unsupported matrix.model in job-level if)
3f120feb — override container entrypoint to sleep (image baked entrypoint=bash)
728fbb65 — HuggingFace fallback when /models cache missing on DO runners (this addresses the most common first-dispatch failure mode)

Known limitations (tracked for v0.2)

First end-to-end dispatch (run 27068415810, before commit 728fbb65) surfaced three categories of failure on DO runners:

Missing /models cache (4/5 jobs) — DO runners did not have /models/{repo}/{model} populated. Fixed in this PR via HF fallback (commit 728fbb65). Requires secrets.HF_TOKEN_TEST to be configured on the repo.
Kimi hipErrorIllegalState on container start — needs runner-side GPU init investigation. Tracked separately.
DSR1 RCCL bootstrap timeout (store->get('0') 600s wait) — multi-process bootstrap config on DO runners. Tracked separately.

(2) and (3) require ATOM team + DO runner ops coordination and are not in scope for this PR.

Test plan

YAML syntax validated (python -c \"import yaml; yaml.safe_load(...)\")
Workflow dispatch UI smoke (no-op models=skip style not implemented; dispatch on a known-good container expected to be the first real validation)
Post-merge: dispatch on rocm/atom-dev:atom0.1.4-rc0-aiter0.1.15-rc0 once HF_TOKEN_TEST secret is in place, confirm at least one model passes end-to-end

Why merge now

The scaffold + HF fallback is enough to be the starting point for the v0.1.5 / v0.1.16 paired-release cycle (~2026-06-22). Iterations (Kimi GPU init, DSR1 RCCL) land as follow-up PRs. Keeping the scaffold on main lets ATOM team + RE team converge on the same gate definition.

cc @valarLip @plehnert @ROCmRichardLi

Multi-model GSM8K gate that dispatches against a paired AITER+ATOM container image. Runs each of {dsr1, minimax, qwen3, glm5, kimi} on its own DO MI350X 8-GPU runner in parallel. Output convention: MODEL_FLEX_EXTRACT=<value> for each job, gated on per-model threshold. Designed for the AITER+ATOM bi-weekly paired-release pilot. First test target: rocm/atom-dev:atom0.1.4-rc0-aiter0.1.15-rc0.

…del in job-level if)

…entrypoint=bash)

DO MI350X runners do not yet have /models populated for the paired-release matrix (only model_path layout assumed by ATOM repos). Without this fallback 4/5 model jobs fail with HFValidationError on first dispatch. The job now: - checks if ${MODEL_PATH} exists in the container - if yes: uses it as before (-v /models:/models still mounted) - if no: strips the /models/ prefix and uses the HF repo id, requiring HF_TOKEN to be set (already passed via secrets.HF_TOKEN_TEST) First-pull cost is the HF download time. Subsequent runs on the same runner benefit from HF cache. Refs: paired-release-gate pilot post-mortem, task #79 (DO runner debug v0.2).

github-actions · 2026-06-06T19:01:34Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3578 --add-label <label>

Copilot

Pull request overview

Adds a new manual GitHub Actions workflow to validate paired AITER+ATOM release candidate containers by running a 5-model GSM8K accuracy gate in parallel on dedicated MI350X runners, then summarizing results.

Changes:

Introduces .github/workflows/paired-release-gate.yaml (workflow_dispatch) to run one job per model in a matrix on linux-aiter-do-mi350x-8.
Runs an ATOM OpenAI-compatible server in the container and evaluates lm_eval gsm8k --num_fewshot 3, gating against per-model thresholds.
Adds a HuggingFace fallback path when a local /models cache entry is missing (but currently has a correctness bug in the fallback condition).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+      models:
+        description: 'Comma-separated subset (or "all")'
+        required: false
+        default: 'all'


+concurrency:
+  group: ${{ github.workflow }}-${{ inputs.container }}
+  cancel-in-progress: false
+
+jobs:


+          if [ -f "/etc/podinfo/gha-render-devices" ]; then
+            DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)
+          else
+            DEVICE_FLAG="--device /dev/dri"
+          fi
+          docker run -dt \
+            --device=/dev/kfd $DEVICE_FLAG \
+            --ipc=host --group-add video --shm-size=64G --privileged \
+            --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
+            --ulimit memlock=-1 --ulimit stack=67108864 \
+            --network=host \
+            ${{ matrix.env_vars }} \
+            -e HF_TOKEN="${HF_TOKEN:-${{ secrets.HF_TOKEN_TEST }}}" \
+            -v /models:/models \
+            --name ${INSTANCE} \
+            --entrypoint=sleep ${CONTAINER_IMAGE} infinity


+            EFFECTIVE_MODEL='${MODEL_PATH}'
+            if [ ! -d \"${MODEL_PATH}\" ] && [ ! -e \"${MODEL_PATH}/config.json\" ]; then
+              HF_REPO=\$(echo '${MODEL_PATH}' | sed 's|^/models/||')


sunway513 added 4 commits June 6, 2026 03:14

ci: fix paired-release-gate workflow yaml (drop unsupported matrix.mo…

181a698

…del in job-level if)

ci(paired-gate): override container entrypoint to sleep (image baked …

3f120fe

…entrypoint=bash)

sunway513 requested review from a team and Copilot June 6, 2026 19:00

Copilot started reviewing on behalf of sunway513 June 6, 2026 19:00 View session

Copilot AI reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: add paired-release validation gate workflow (AITER+ATOM matrix)#3578

ci: add paired-release validation gate workflow (AITER+ATOM matrix)#3578
sunway513 wants to merge 4 commits into
mainfrom
feat/paired-release-gate

sunway513 commented Jun 6, 2026

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sunway513 commented Jun 6, 2026

Summary

Container contract

Commit history on this branch

Known limitations (tracked for v0.2)

Test plan

Why merge now

Uh oh!

github-actions Bot commented Jun 6, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants