Skip to content

fix(mergify): auto-route via queue_conditions, not pull_request_rules#393

Merged
coseto6125 merged 3 commits into
mainfrom
fix/mergify-queue-conditions
May 23, 2026
Merged

fix(mergify): auto-route via queue_conditions, not pull_request_rules#393
coseto6125 merged 3 commits into
mainfrom
fix/mergify-queue-conditions

Conversation

@coseto6125
Copy link
Copy Markdown
Owner

@coseto6125 coseto6125 commented May 23, 2026

Symptom

PRs #386 #387 #388 reached CLEAN merge state with all required checks green, but Mergify Merge Queue check stayed at completed/neutral · Merge queue is ready — never embarking. Manual @Mergifyio queue main-docs-only on #386 worked immediately, then DEQUEUED 1m later with 'Checks failed'.

Root cause (3 distinct bugs)

Bug 1: Auto-routing config
pull_request_rules with a queue action is explicit-enqueue in modern Mergify. Auto-routing lives in queue_rules.queue_conditions. Our config had conditions in pull_request_rules → Mergify saw a queueable PR but no routing → "ready" forever.

Bug 2: ci:run label gate
queue_conditions require check-success=Test (all platforms), but Test only fired with the ci:run label. Chicken-egg: PR can't enter queue without Test green, Test won't fire without label, toggling label triggers cancellation races.

Bug 3: cross-pr-conflict placement
Required in merge_conditions, but Mergify evaluates merge_conditions on the speculative-trial commit (fresh SHA = PR diff + main tip). ecp-pr-analyze.yml triggers only on pull_request events, so the trial commit has no cross-pr-conflict status → trial blocks forever → dequeue.

Fix (3 commits)

  1. fix(mergify): migrate routing — pull_request_rules → queue_rules.queue_conditions
  2. fix(ci): drop ci:run gate — Test now fires unconditionally; merge-queue label replaces ci:run as the manual opt-in
  3. fix(mergify): move cross-pr-conflict — merge_conditions → queue_conditions (evaluates on PR HEAD where status actually exists)

Test plan

@github-actions
Copy link
Copy Markdown
Contributor

[]

@github-actions github-actions Bot added the ecp:risk-low ecp signal label May 23, 2026
…ions

Symptom: every PR satisfying the routing rules sat at 'Mergify Merge
Queue / completed / neutral / Merge queue is ready' indefinitely, never
embarking. Manual '@Mergifyio queue main-docs-only' on PR #386 worked
immediately, proving queue_rules themselves are wired up.

Root cause: in modern Mergify, pull_request_rules with a 'queue' action
is an EXPLICIT-enqueue mechanism (for commands / advanced filters). It
does not auto-route based on its 'conditions:' block. Auto-routing lives
in queue_rules.queue_conditions: Mergify evaluates each PR against every
queue_rule's queue_conditions and embarks it into the first match.

Fix: move the routing conditions (base, labels, required-checks) from
each pull_request_rule into the corresponding queue_rule's
queue_conditions, and drop pull_request_rules entirely. The queue_rule
order matters — most-specific (area=test/parser/cli/docs) before the
fallback (no area label) so the catch-all doesn't grab specialized PRs.
ci:run predated Mergify and served as a manual opt-in to fire the Test
matrix on PRs (the GitHub-personal-repo substitute for self-approve,
when no merge queue existed). With Mergify queue_conditions taking over
the merge gate, ci:run became redundant AND actively harmful:

- Mergify queue_conditions require check-success=Test (all platforms),
  but Test would only fire after the label landed → chicken/egg.
- Toggling ci:run between pushes triggered cancellation races with
  in-flight CI workflows (confirmed via the t-mergify-* test PRs).

Remove:
- 'labeled' / 'unlabeled' from pull_request event types (no longer
  needed to trigger workflow runs).
- The contains(...ci:run) check from the test job's if-guard. Test now
  fires unconditionally on PRs; detect-changes still short-circuits
  the matrix to skipped for docs-only changes.
- The PR-skipped FAIL branch in the test-result aggregator — skip only
  happens via [skip ci] now, which is a legitimate opt-out.

The new merge gate is the 'merge-queue' label (applied by author when
ready), evaluated by Mergify queue_conditions alongside the green-check
requirement.
PR #386 entered main-docs-only then immediately dequeued with
'Checks failed' — Mergify's required-conditions box showed
`check-success=ecp/cross-pr-conflict` still unchecked even though
the PR's HEAD commit had that status as success.

Cause: Mergify's speculative trial creates a fresh commit (PR diff +
main tip). merge_conditions are evaluated on THAT commit, not the
PR's HEAD. ecp-pr-analyze.yml triggers only on `pull_request` events,
so it never fires against the trial commit → no cross-pr-conflict
status ever reports → merge_conditions blocked → dequeue.

Move the check from merge_conditions into queue_conditions. The
question 'is it safe to queue this PR?' is evaluated on the PR HEAD,
where the status exists. Once queued, Mergify's own speculative CI
(branch protection's required checks running on the trial commit) is
the merge gate.
@coseto6125 coseto6125 force-pushed the fix/mergify-queue-conditions branch from 817968a to 083bafa Compare May 23, 2026 13:29
@coseto6125 coseto6125 merged commit 0b4fdb2 into main May 23, 2026
8 of 9 checks passed
@coseto6125 coseto6125 deleted the fix/mergify-queue-conditions branch May 23, 2026 13:30
github-actions Bot added a commit that referenced this pull request May 23, 2026
…state (#402)

* diagnostic(ci): instrument all 5 checkout sites to capture flake state

The 'could not read Username for https://github.com' / 'Bad credentials'
failures across today's runs (#388 macos, #393 main-push ubuntu, #401
ubuntu test, etc.) all happen inside actions/checkout@v6.0.2's
'Fetching the repository' step right after a successful 'Setting up
auth' that writes an includeIf-scoped credentials config. Failure looks
consistent with includeIf gitdir path resolution mismatch, but we have
no direct evidence yet — the hypothesis ought to be confirmed before we
swap to a workaround that downgrades security (persist-credentials:
true) or replaces actions/checkout outright.

This PR adds .github/actions/diagnose-checkout-failure (composite
action) and wires every actions/checkout call site in ci.yml to:

  1. continue-on-error: true on the checkout step (id: checkout)
  2. follow-up step that runs only on steps.checkout.conclusion ==
     'failure' and dumps:
     - git version
     - workspace path / readlink chain (catches symlink mismatches)
     - .git/config contents (raw)
     - resolved gitdir from git rev-parse --absolute-git-dir
     - effective config after includeIf evaluation
     - existence of every includeIf-referenced path
     - leftover credentials files in RUNNER_TEMP / /tmp / /github/runner_temp
     - per-component readlink of GITHUB_WORKSPACE (symlink detection)
  3. Re-fails the job at the end so the diagnostic doesn't silently
     convert a real failure into a pass.

Once the next flake fires, the dump will tell us which of:
  - includeIf path doesn't match actual gitdir
  - credentials file got cleaned up between setup and fetch
  - workspace path has a symlink we didn't anticipate
  - the runner had no token in the first place
…actually caused the failure, and we can apply the proportionate fix.

This PR is intentionally diagnostic-only: no behavior change beyond
the failure-path noise. Once root cause is identified and fixed,
revert by removing the diagnostic step / continue-on-error from each
checkout block.

* ci: re-trigger checks for merge commit 53e93a7

The 'Update branch' merge commit was authored by the GitHub UI using
GITHUB_TOKEN, which by GH Actions design does not trigger downstream
pull_request workflows on the resulting SHA. Only the Mergify check
(lazy-evaluated) registered against 53e93a7; CI / CodeQL / ecp PR analyze
all targeted the prior SHA (8ae2fc0).

This empty commit emits a user-authored push so pull_request:synchronize
fires and the full check matrix runs against the actual PR head.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
coseto6125 added a commit that referenced this pull request May 23, 2026
… ref

Root cause for today's recurring 'could not read Username for https://github.com'
flakes (#388 macos, #393 ubuntu, #395 dep-review, #397 main-push,
#401 ubuntu Test, #402 instrumentation): the runner image ships with a
default credential.helper in /etc/gitconfig that errors with ENXIO when
git falls back to it. actions/checkout sets up http.extraheader scoped
to the repo URL, but on certain runner image revisions the auth setup
escapes our isolation and the system helper gets invoked anyway.

Rather than work around the broken helper (which would leave a
permanent shellcheck-style `-c credential.helper=` debt on every
git command), we eliminate the failure surface entirely: the only steps
that do post-checkout fetches all want the same thing — main's tip
SHA — and GitHub already provides that in the pull_request event
payload (`github.event.pull_request.base.sha`).

# ci.yml — `Detect code changes` job

Was:
  git fetch --no-tags origin "$BASE_REF"
  diff_range="origin/$BASE_REF...HEAD"

Now:
  # Event payload exposes base.sha for free; checkout used default
  # ref (refs/pull/N/merge) so both sides are in local object DB.
  diff_range="$BASE_SHA...HEAD"

Three-dot range still gives merge-base..HEAD semantics — equivalent to
the old behavior, no network needed.

# ecp-pr-analyze.yml — drop `Fetch base ref` + recompute branch point locally

Was:
  - uses: actions/checkout@v6.0.2
    with:
      ref: ${{ pull_request.head.sha }}     # only PR head ancestors fetched
  - name: Fetch base ref
    run: git fetch origin "$BASE_REF:..."  # network — triggers ENXIO flake
  ...
  BASE=$(git merge-base "origin/$BASE_REF" HEAD)

Now:
  - uses: actions/checkout@v6.0.2
    with:
      fetch-depth: 0
      # No `ref:` override. Default refs/pull/N/merge brings both PR
      # head AND base history into local object DB.
  - name: Compute branch point + switch HEAD to PR head
    run: |
      PR_HEAD=$(git rev-parse HEAD^1)       # merge ref's parent 1
      BASE_TIP=$(git rev-parse HEAD^2)      # merge ref's parent 2
      BRANCH_POINT=$(git merge-base "$PR_HEAD" "$BASE_TIP")
      git checkout "$PR_HEAD"

Branch point is the SAME value the old `git merge-base origin/<base>`
would produce — but derived purely from local objects (the merge ref's
two parents) instead of a network fetch.

# Edge cases

- PR with merge conflicts: GitHub doesn't compute refs/pull/N/merge,
  checkout fails. This is correct — conflicted PRs can't merge, so
  ecp impact analysis would be meaningless. Author resolves conflict,
  ref recomputed, next run works.
- Push to main / merge_group / workflow_dispatch: unchanged code path
  (already used BEFORE_SHA / blanket 'code=true', no fetch).

# Result

- One entire class of CI flake eliminated: no post-checkout git fetch
  means no credential-helper invocation means no ENXIO.
- No upstream-bug workaround comment debt.
- Slightly faster CI (one fewer network round-trip per PR job).
- Closes the path that diagnostic instrumentation in PR #402 was
  trying to capture; PR #402 can be closed once this lands.
coseto6125 added a commit that referenced this pull request May 23, 2026
… ref (#404)

Root cause for today's recurring 'could not read Username for https://github.com'
flakes (#388 macos, #393 ubuntu, #395 dep-review, #397 main-push,
#401 ubuntu Test, #402 instrumentation): the runner image ships with a
default credential.helper in /etc/gitconfig that errors with ENXIO when
git falls back to it. actions/checkout sets up http.extraheader scoped
to the repo URL, but on certain runner image revisions the auth setup
escapes our isolation and the system helper gets invoked anyway.

Rather than work around the broken helper (which would leave a
permanent shellcheck-style `-c credential.helper=` debt on every
git command), we eliminate the failure surface entirely: the only steps
that do post-checkout fetches all want the same thing — main's tip
SHA — and GitHub already provides that in the pull_request event
payload (`github.event.pull_request.base.sha`).

# ci.yml — `Detect code changes` job

Was:
  git fetch --no-tags origin "$BASE_REF"
  diff_range="origin/$BASE_REF...HEAD"

Now:
  # Event payload exposes base.sha for free; checkout used default
  # ref (refs/pull/N/merge) so both sides are in local object DB.
  diff_range="$BASE_SHA...HEAD"

Three-dot range still gives merge-base..HEAD semantics — equivalent to
the old behavior, no network needed.

# ecp-pr-analyze.yml — drop `Fetch base ref` + recompute branch point locally

Was:
  - uses: actions/checkout@v6.0.2
    with:
      ref: ${{ pull_request.head.sha }}     # only PR head ancestors fetched
  - name: Fetch base ref
    run: git fetch origin "$BASE_REF:..."  # network — triggers ENXIO flake
  ...
  BASE=$(git merge-base "origin/$BASE_REF" HEAD)

Now:
  - uses: actions/checkout@v6.0.2
    with:
      fetch-depth: 0
      # No `ref:` override. Default refs/pull/N/merge brings both PR
      # head AND base history into local object DB.
  - name: Compute branch point + switch HEAD to PR head
    run: |
      PR_HEAD=$(git rev-parse HEAD^1)       # merge ref's parent 1
      BASE_TIP=$(git rev-parse HEAD^2)      # merge ref's parent 2
      BRANCH_POINT=$(git merge-base "$PR_HEAD" "$BASE_TIP")
      git checkout "$PR_HEAD"

Branch point is the SAME value the old `git merge-base origin/<base>`
would produce — but derived purely from local objects (the merge ref's
two parents) instead of a network fetch.

# Edge cases

- PR with merge conflicts: GitHub doesn't compute refs/pull/N/merge,
  checkout fails. This is correct — conflicted PRs can't merge, so
  ecp impact analysis would be meaningless. Author resolves conflict,
  ref recomputed, next run works.
- Push to main / merge_group / workflow_dispatch: unchanged code path
  (already used BEFORE_SHA / blanket 'code=true', no fetch).

# Result

- One entire class of CI flake eliminated: no post-checkout git fetch
  means no credential-helper invocation means no ENXIO.
- No upstream-bug workaround comment debt.
- Slightly faster CI (one fewer network round-trip per PR job).
- Closes the path that diagnostic instrumentation in PR #402 was
  trying to capture; PR #402 can be closed once this lands.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ecp:risk-low ecp signal

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant