Skip to content

ci: add paired-release validation gate workflow (AITER+ATOM matrix)#3578

Open
sunway513 wants to merge 4 commits into
mainfrom
feat/paired-release-gate
Open

ci: add paired-release validation gate workflow (AITER+ATOM matrix)#3578
sunway513 wants to merge 4 commits into
mainfrom
feat/paired-release-gate

Conversation

@sunway513
Copy link
Copy Markdown
Collaborator

Summary

Adds .github/workflows/paired-release-gate.yaml — a workflow_dispatch validation gate for the AITER+ATOM bi-weekly paired release pilot (cadence proposal, pilot underway with AITER v0.1.15-rc0 + ATOM v0.1.4-rc0).

Each dispatch:

  • pulls a paired rocm/atom-dev:atomX.Y-rcN-aiterX.Y-rcN container
  • spawns one job per model (DSR1, MiniMax, Qwen3, GLM-5, Kimi) in parallel on dedicated linux-aiter-do-mi350x-8 runners (DO MI350X pool, 6 nodes x 8 GPUs)
  • launches ATOM server + runs lm_eval gsm8k --num_fewshot 3
  • gates each model against its upstream-canonical GSM8K threshold

Wall time: ~20 min for 5-model gate vs ~50-60 min serial on a single node.

Container contract

Image must ship: ATOM source + AITER wheel + matching triton / flydsl. Workflow does NOT install AITER from source.

Verified container for first run: rocm/atom-dev:atom0.1.4-rc0-aiter0.1.15-rc0 (pushed to Docker Hub 2026-06-05).

Commit history on this branch

  1. 55db09a9 — initial workflow scaffold
  2. 181a6980 — yaml fix (drop unsupported matrix.model in job-level if)
  3. 3f120feb — override container entrypoint to sleep (image baked entrypoint=bash)
  4. 728fbb65HuggingFace fallback when /models cache missing on DO runners (this addresses the most common first-dispatch failure mode)

Known limitations (tracked for v0.2)

First end-to-end dispatch (run 27068415810, before commit 728fbb65) surfaced three categories of failure on DO runners:

  1. Missing /models cache (4/5 jobs) — DO runners did not have /models/{repo}/{model} populated. Fixed in this PR via HF fallback (commit 728fbb65). Requires secrets.HF_TOKEN_TEST to be configured on the repo.
  2. Kimi hipErrorIllegalState on container start — needs runner-side GPU init investigation. Tracked separately.
  3. DSR1 RCCL bootstrap timeout (store->get('0') 600s wait) — multi-process bootstrap config on DO runners. Tracked separately.

(2) and (3) require ATOM team + DO runner ops coordination and are not in scope for this PR.

Test plan

  • YAML syntax validated (python -c \"import yaml; yaml.safe_load(...)\")
  • Workflow dispatch UI smoke (no-op models=skip style not implemented; dispatch on a known-good container expected to be the first real validation)
  • Post-merge: dispatch on rocm/atom-dev:atom0.1.4-rc0-aiter0.1.15-rc0 once HF_TOKEN_TEST secret is in place, confirm at least one model passes end-to-end

Why merge now

The scaffold + HF fallback is enough to be the starting point for the v0.1.5 / v0.1.16 paired-release cycle (~2026-06-22). Iterations (Kimi GPU init, DSR1 RCCL) land as follow-up PRs. Keeping the scaffold on main lets ATOM team + RE team converge on the same gate definition.

cc @valarLip @plehnert @ROCmRichardLi

sunway513 added 4 commits June 6, 2026 03:14
Multi-model GSM8K gate that dispatches against a paired AITER+ATOM
container image. Runs each of {dsr1, minimax, qwen3, glm5, kimi} on its
own DO MI350X 8-GPU runner in parallel. Output convention:
MODEL_FLEX_EXTRACT=<value> for each job, gated on per-model threshold.

Designed for the AITER+ATOM bi-weekly paired-release pilot.
First test target: rocm/atom-dev:atom0.1.4-rc0-aiter0.1.15-rc0.
DO MI350X runners do not yet have /models populated for the paired-release
matrix (only model_path layout assumed by ATOM repos). Without this fallback
4/5 model jobs fail with HFValidationError on first dispatch.

The job now:
  - checks if ${MODEL_PATH} exists in the container
  - if yes: uses it as before (-v /models:/models still mounted)
  - if no: strips the /models/ prefix and uses the HF repo id, requiring
    HF_TOKEN to be set (already passed via secrets.HF_TOKEN_TEST)

First-pull cost is the HF download time. Subsequent runs on the same runner
benefit from HF cache.

Refs: paired-release-gate pilot post-mortem, task #79 (DO runner debug v0.2).
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 6, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3578 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new manual GitHub Actions workflow to validate paired AITER+ATOM release candidate containers by running a 5-model GSM8K accuracy gate in parallel on dedicated MI350X runners, then summarizing results.

Changes:

  • Introduces .github/workflows/paired-release-gate.yaml (workflow_dispatch) to run one job per model in a matrix on linux-aiter-do-mi350x-8.
  • Runs an ATOM OpenAI-compatible server in the container and evaluates lm_eval gsm8k --num_fewshot 3, gating against per-model thresholds.
  • Adds a HuggingFace fallback path when a local /models cache entry is missing (but currently has a correctness bug in the fallback condition).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +22 to +25
models:
description: 'Comma-separated subset (or "all")'
required: false
default: 'all'
Comment on lines +27 to +31
concurrency:
group: ${{ github.workflow }}-${{ inputs.container }}
cancel-in-progress: false

jobs:
Comment on lines +88 to +103
if [ -f "/etc/podinfo/gha-render-devices" ]; then
DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)
else
DEVICE_FLAG="--device /dev/dri"
fi
docker run -dt \
--device=/dev/kfd $DEVICE_FLAG \
--ipc=host --group-add video --shm-size=64G --privileged \
--cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
--ulimit memlock=-1 --ulimit stack=67108864 \
--network=host \
${{ matrix.env_vars }} \
-e HF_TOKEN="${HF_TOKEN:-${{ secrets.HF_TOKEN_TEST }}}" \
-v /models:/models \
--name ${INSTANCE} \
--entrypoint=sleep ${CONTAINER_IMAGE} infinity
Comment on lines +121 to +123
EFFECTIVE_MODEL='${MODEL_PATH}'
if [ ! -d \"${MODEL_PATH}\" ] && [ ! -e \"${MODEL_PATH}/config.json\" ]; then
HF_REPO=\$(echo '${MODEL_PATH}' | sed 's|^/models/||')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants