Skip to content

[CI test - DO NOT MERGE] Intentionally break two lit tests (v2)#192

Closed
lamb-j wants to merge 1 commit into
amd-stagingfrom
users/lambj/spirv-ci-test-new-failures-v2
Closed

[CI test - DO NOT MERGE] Intentionally break two lit tests (v2)#192
lamb-j wants to merge 1 commit into
amd-stagingfrom
users/lambj/spirv-ci-test-new-failures-v2

Conversation

@lamb-j

@lamb-j lamb-j commented May 8, 2026

Copy link
Copy Markdown
Collaborator

Modifies CHECK directives in two currently-passing transcoding tests with sentinel strings that won't appear in actual translator output:

  • test/transcoding/fmax.ll — broke CHECK-SPIRV line
  • test/transcoding/CommonAddrspaceGlobalVar.ll — broke CHECK-LLVM line

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

🔴 2 new translator lit failures introduced by this PR (non-blocking; see run).

🔴 New failures (2) — likely caused by this PR
FAIL: LLVM_SPIRV :: transcoding/CommonAddrspaceGlobalVar.ll
FAIL: LLVM_SPIRV :: transcoding/fmax.ll
🟢 Fixed by this PR (0) — failing on baseline, passing here

(none)

⚠️ Pre-existing on `amd-staging` (21)
FAIL: LLVM_SPIRV :: constant/local-float-point-constants.ll
FAIL: LLVM_SPIRV :: extensions/EXT/SPV_EXT_float8/conversions_matrix.ll
FAIL: LLVM_SPIRV :: extensions/EXT/SPV_EXT_float8/conversions_scalar_vector.ll
FAIL: LLVM_SPIRV :: extensions/EXT/SPV_EXT_shader_atomic_float_/atomicrmw_fsub_half.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_float4/conversions_packed.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_float4/conversions_scalar_vector.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_fp_conversions/spv_intel_fp_conversions.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_int4/conversions_packed.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_sigmoid/sigmoid_f16.ll
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_bfloat16/cooperative_matrix_bfloat16.ll
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_cooperative_matrix/conversion_instructions.ll
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_subgroup_rotate/SPV_KHR_subgroup_rotate.cl
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_uniform_group_instructions/group-instructions.ll
FAIL: LLVM_SPIRV :: transcoding/float16.ll
FAIL: LLVM_SPIRV :: transcoding/image_signedness_spv_ir.ll
FAIL: LLVM_SPIRV :: transcoding/OpImageSampleExplicitLod_arg.cl
FAIL: LLVM_SPIRV :: transcoding/spec_const.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_clustered_reduce.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_non_uniform_arithmetic.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_shuffle.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_shuffle_relative.ll

lamb-j added a commit that referenced this pull request May 10, 2026
The PR-head and baseline lit steps stay continue-on-error so pre-
existing amd-staging breakage doesn't block. The partition comment
already surfaces 🔴 New / 🟢 Fixed / ⚠️ Pre-existing buckets.

Add a final gate step that exits non-zero when newFails > 0 (PR head
FAILs not present in baseline). Real PR-introduced regressions now
turn the check red instead of just dropping into a comment.

Degrades gracefully — if either capture file is missing (e.g., one of
the lit runs failed to produce output) the gate emits a warning and
passes rather than blocking on incomplete data.

Now that the partition logic has been validated end-to-end on PRs
#190/#192 (🔴 New) and #193 (🟢 Fixed), this is a safe step up from
informational to blocking.
lamb-j added a commit that referenced this pull request May 10, 2026
* [CI] Split spirv-ci into Build + parallel Test jobs

Today the SPIRV CI is one mega-job (Linux Build & Test) that builds
the LLVM/translator/Comgr stack and runs all lit suites sequentially.
As we add more test categories (rocm-examples, hip-tests per the SPIRV
Automated Testing Status confluence page), one mega-check is too coarse
— a failing suite hides the others, can't selectively re-run, and the
required-check is all-or-nothing.

Split into 4 jobs:

  Linux Build (required)
   ├─► Linux Test - SPIRV translator lit  (informational, baseline-diff)
   ├─► Linux Test - LLVM SPIRV codegen    (informational)
   └─► Linux Test - Comgr                  (informational)

Build uploads `build/`, `build-comgr/`, `build-device-libs/` (after
strip --strip-unneeded) as a single GHA artifact. Test jobs do a fresh
checkout of source trees + download the artifact. Source isn't shipped
in the artifact (cleaner: artifact = build outputs, checkout = source).

Translator-lit baseline-diff (PR head vs amd-staging) preserved as-is —
runs in the translator-lit test job: download artifact, run lit at PR
head, swap translator src to amd-staging tip, cmake-reconfigure +
ninja-incremental + lit-rerun, post sticky comment with new/fixed/
pre-existing partition. Translator checkout in this job uses
fetch-depth: 0 so the baseline `git checkout amd-staging` works.

Tests stay informational (per design discussion): promote to required
individually as each suite stabilizes. Only Linux Build is required.

Pragmatic deviations from TheRock conventions:
  - GHA artifacts (not S3) for transport — no OIDC/IAM access here yet.
    Will swap to S3 patterns (post_build_upload.py / fetch_artifacts.py)
    when this workflow moves into TheRock. Job graph + naming match
    TheRock so the swap is mechanical.
  - "Linux" prefix on job names — TheRock uses workflow-file-as-platform
    convention, but we keep the prefix consistent with the prior rename
    and our future Windows expansion plans.

ACTION REQUIRED on merge: update the amd-staging-psdb ruleset's
required-check context from "Linux Build & Test" → "Linux Build".
Without it the old rule will dangle (no job named "Linux Build & Test"
exists anymore) and PRs would block on a permanently-pending placeholder.

* [CI] Refactor into 2-file workflow_call chain (TheRock conventions)

Mirrors TheRock's Multi-Arch CI shape: a top-level dispatcher with
platform-variant jobs that call reusable per-platform workflows. The
Linux variant is byte-isolated so adding Windows later is just a new
file + dispatcher entry.

Files:
  - spirv-ci.yml — top-level dispatcher (~25 lines), byte-identical
    across both this repo and ROCm/llvm-project. Triggers on
    pull_request + workflow_dispatch. Sole job (linux_release, name
    Linux::release) calls spirv-ci-linux.yml.
  - spirv-ci-linux.yml — workflow_call. Holds the build job + 3 test
    jobs (factored from the prior single spirv-ci.yml).

Rendered check_run names in the PR rollup (workflow_call composes the
slash hierarchy):
  - SPIRV Compiler CI / Linux::release / Build
  - SPIRV Compiler CI / Linux::release / Test SPIRV translator lit
  - SPIRV Compiler CI / Linux::release / Test LLVM SPIRV codegen
  - SPIRV Compiler CI / Linux::release / Test Comgr

Convention alignment with TheRock (verified against
.github/workflows/multi_arch_ci.yml, multi_arch_ci_linux.yml and the
rest of the workflow set):
  - Top-level workflow name Title Case, no branch suffix
  - Reusable workflow has its own descriptive name
  - snake_case job IDs, separate display name field
  - No literal slashes in job names — slashes come from workflow_call
  - concurrency: only on dispatcher, not on workflow_call
  - secrets: inherit at the dispatcher → variant call
  - permissions read-only at workflow level; pull-requests: write
    escalated only on the translator-lit job that posts the comment
  - workflow_dispatch alongside pull_request — Multi-Arch's check
    names don't get the "(pull_request)" suffix even with both
    triggers, so the disambiguation bug we hit before is sidestepped
    by the workflow_call structure
  - Container image pinned with @sha256:

ACTION REQUIRED on merge: update the amd-staging-psdb ruleset's
required-check context from "Linux Build" → "Linux::release / Build".
Without it the dangling rule blocks PRs on a permanently-pending
placeholder.

* [CI] Grant pull-requests: write at dispatcher level (fix startup_failure)

The previous commit moved pull-requests: write to a job-level permission
on the test_translator_lit job inside spirv-ci-linux.yml. Per GHA rules,
a called workflow's GITHUB_TOKEN permissions are capped by the caller's
permissions; if the caller's workflow-level grant is `contents: read`
only, the called workflow can't add pull-requests: write — validation
fails at startup, no jobs run.

Symptom: PR #194 ran with conclusion=startup_failure, no check_runs
created, empty PR rollup.

Fix: grant pull-requests: write at the dispatcher's workflow-level
permissions. The called workflow's job-level grant on
test_translator_lit still narrows the actual usage to that single job;
this just lifts the caller-side cap.

Multi-Arch CI gets away with `contents: read` only because none of its
jobs post comments. Ours does, so this grant is required.

* [CI] Tar build trees for artifact transport (preserve modes + .git)

The previous commit's first run hit two distinct failures rooted in
actions/upload-artifact@v4 defaults:

  1. Comgr test job: "/bin/sh: clang-23: Permission denied" — v4
     strips executable bits on upload, so binaries come back
     non-executable.

  2. Codegen test job: cmake reconfigure (triggered when ninja sees
     the freshly-checked-out source as newer than the build tree)
     failed inside FetchContent's SPIRV-Headers update step with
     "fatal: not a git repository: '.git'" — v4 also excludes
     hidden files (the .git dir under build/_deps/) by default.

Translator-lit job appeared green but actually hit the same first-lit
breakage; continue-on-error: true masked it.

Fix: tar the build trees ourselves before upload, untar after
download. Tar preserves both file modes and hidden files in one shot,
which is simpler than chmodding +x and toggling include-hidden-files
separately.

* [CI] Test-job wall-time fixes: tar -xmf + shallower translator clone

Two wall-time bugs found while diagnosing PR #194's slow codegen test:

1. Untar with `tar -xf` restored mtimes from when the build job
   produced them. Test-job source checkouts run just before, so
   freshly-checked-out source files appeared newer than (older) build
   outputs from the tar, and ninja cascade-rebuilt instead of running
   the requested test target. Switch to `tar -xmf` (skip mtime
   restore) so build outputs are newer. Observed ~5-10 min wasted on
   codegen test job.

2. Translator-lit job's translator checkout used `fetch-depth: 0` (full
   history) only to enable the later `git checkout amd-staging` for the
   baseline swap. But `git fetch --depth=1 origin amd-staging` followed
   by `git checkout FETCH_HEAD` works on a shallow clone too. Switch to
   `fetch-depth: 1`. ~30-60s save per run.

Drive-by: trim verbose comments on the tar/untar steps.

* [CI] Gate translator-lit job on new failures

The PR-head and baseline lit steps stay continue-on-error so pre-
existing amd-staging breakage doesn't block. The partition comment
already surfaces 🔴 New / 🟢 Fixed / ⚠️ Pre-existing buckets.

Add a final gate step that exits non-zero when newFails > 0 (PR head
FAILs not present in baseline). Real PR-introduced regressions now
turn the check red instead of just dropping into a comment.

Degrades gracefully — if either capture file is missing (e.g., one of
the lit runs failed to produce output) the gate emits a warning and
passes rather than blocking on incomplete data.

Now that the partition logic has been validated end-to-end on PRs
#190/#192 (🔴 New) and #193 (🟢 Fixed), this is a safe step up from
informational to blocking.
Re-creation of #190 (closed) on top of current amd-staging which
includes the "Linux Build & Test" job rename (#191) and required-check
ruleset update. Verifies end-to-end:

  - the renamed job emits "Linux Build & Test" check_run name
  - the updated required-check rule resolves to satisfied
  - the partition comment correctly classifies these test failures
    as 🔴 New failures

Modifies CHECK directives in two passing transcoding tests with
sentinel strings that won't appear in the actual output:

  - test/transcoding/fmax.ll
  - test/transcoding/CommonAddrspaceGlobalVar.ll

Close without merge once verified.
@lamb-j lamb-j force-pushed the users/lambj/spirv-ci-test-new-failures-v2 branch from 176b27b to 430d6d3 Compare May 10, 2026 21:37
@lamb-j lamb-j closed this May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant