Skip to content

[AAASM-2328] 🐛 (ci): Retry gh release download on cross-repo race (release not found)#65

Merged
Chisanan232 merged 1 commit into
masterfrom
v0.0.1/AAASM-2328/fix/release_node_retry
Jun 1, 2026
Merged

[AAASM-2328] 🐛 (ci): Retry gh release download on cross-repo race (release not found)#65
Chisanan232 merged 1 commit into
masterfrom
v0.0.1/AAASM-2328/fix/release_node_retry

Conversation

@Chisanan232
Copy link
Copy Markdown
Contributor

@Chisanan232 Chisanan232 commented Jun 1, 2026

Description

release-node.yml fires on the same tag-push event as agent-assembly's Release workflow, in parallel. agent-assembly takes ~10 min to build + publish the binaries release-node depends on, so the first download attempt consistently fails:

gh release download "v0.0.1-alpha.3" --repo AI-agent-assembly/agent-assembly --pattern "aasm-*.tar.gz" --dir bin-staging
release not found

Observed on alpha-2 and alpha-3 dry-runs. Hidden in alpha-1 by an earlier-failing pnpm-lock step (AAASM-2098).

Fix

Wrap gh release download in a retry loop: up to 20 attempts × 60s = 20 min ceiling. Distinguish 'release not found' (race → retry) from other errors (auth, rate-limit → fail-fast).

Alternatives considered

Option Why not
workflow_run cross-repo trigger Only fires for same-repo workflows; not feasible cross-repo
repository_dispatch from agent-assembly Cleaner architecturally but needs PAT + extra wiring. Defer until polling cost is a real issue.
Manual workflow_dispatch only Loses automation entirely

Retry-with-backoff is the smallest change that solves the race.

Local verification

  • Simulated retry on race: 3-attempt loop with first 2 returning 'release not found' → third succeeds → loop exits cleanly
  • Fail-fast on non-race error: HTTP 403 rate limit → grep miss on 'release not found' → exit 1 without retry

Related

— Claude Code (Opus 4.7, 1M context)

release-node.yml fires on the same tag-push event as agent-assembly's
Release workflow, in parallel. agent-assembly takes ~10 min to build
+ publish the binaries that release-node depends on, so the first
gh release download attempt consistently fails with:

  release not found

(observed on v0.0.1-alpha.2 and v0.0.1-alpha.3 dry-runs; would have
failed every prior alpha too if not blocked at earlier steps).

Fix: wrap gh release download in a retry loop. Up to 20 attempts
with 60-second sleeps = 20 min ceiling waiting for the Release.
Distinguish 'release not found' (race condition → retry) from
other errors (auth failure, rate limit, etc. → fail-fast).

Verified locally:

  * retry-on-race: 3-attempt simulation with first 2 returning
    'release not found' → third succeeds → loop exits cleanly
  * fail-fast on other error: 'HTTP 403 rate limit' → grep miss
    on 'release not found' → exit 1 without retry

Trade-off vs alternatives considered:

  - workflow_run cross-repo trigger: NOT possible; workflow_run
    only fires for same-repo workflows
  - repository_dispatch from agent-assembly: cleaner architecturally
    but needs PAT + extra cross-repo wiring. Defer until needed.
  - Manual workflow_dispatch only: loses automation entirely.

Retry-with-backoff is the smallest change that solves the race.
Migrate to repository_dispatch later if polling cost becomes real.

Tracked: AAASM-2328
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 1, 2026

@Chisanan232
Copy link
Copy Markdown
Contributor Author

Chisanan232 commented Jun 1, 2026

Claude Code review — AAASM-2328

CI state

25/25 SUCCESSmergeable=MERGEABLE, mergeStateStatus=CLEAN. Full node-sdk matrix green. Note: the retry loop itself only fires on tag-push (when release-node actually runs), not on PR — so PR CI doesn't directly exercise the retry behavior. Verification is local simulation + the next alpha-tag-push.

Scope vs. acceptance criteria

AC Verified Status
gh release download wrapped in 20×60s retry loop (~20 min ceiling) confirmed
Distinguishes 'release not found' (race → retry) from other errors (auth, rate-limit → fail-fast) grep-based discrimination confirmed locally
Clear error message at retry-exhaustion: "Release v${RELEASE_TAG} never appeared after $((MAX_ATTEMPTS * 60))s — agent-assembly Release pipeline likely failed" confirmed
actionlint clean yes
set -euo pipefail semantics preserved (the if cmd; then construct correctly suspends set -e for the conditional) verified by inspection

Design choices worth recording

Why retry-with-backoff over alternatives

  • workflow_run cross-repo: Not feasible — only fires for same-repo workflows.
  • repository_dispatch from agent-assembly: Cleaner architecturally but needs PAT + cross-repo wiring; defer until polling cost becomes a real issue.
  • Drop push: tags trigger, manual workflow_dispatch only: Loses automation entirely.

Retry-with-backoff is the smallest change that solves the actual problem. If 20 min polling per release becomes meaningful CI-minutes cost (~$0.04 at ubuntu-latest rates), revisit repository_dispatch.

Why 20 attempts × 60s

agent-assembly's Release pipeline observed durations:

  • alpha-1: ~12 min total (build × 4 then publish)
  • alpha-2: ~11 min
  • alpha-3: ~10 min

20 min ceiling gives ~2× headroom for outliers. Could tighten to 15 if observed durations stay stable.

Subtle correctness detail

The if gh release download ... 2>&1 | tee /tmp/gh-rd.log; then form does two things:

  1. Captures both stdout and stderr to the log file (via 2>&1 | tee)
  2. Uses the exit code of the LAST command in the pipe (tee, not gh) — but tee only fails on file-write errors, so it almost always returns 0

The exit code we actually want is gh release download's. The pipe-to-tee masks that. However, the if then-branch correctly handles the success path (regardless of tee's exit), and the else-branch path is reached via the grep on the log file — which is the actual signal we want. So the logic is correct, but the pipe construct is more brittle than a gh release download ... > /tmp/gh-rd.log 2>&1 would be.

Not blocking — the current form works correctly under the observed failure modes. But a small follow-up to use pipefail-aware redirection would be cleaner.

Verdict

Ready for human approval and merge. The fix is correct, locally verified, design choices documented. The pipe-construct quibble in the section above is style, not correctness.

— Claude Code (Opus 4.7, 1M context)

@Chisanan232 Chisanan232 merged commit fb0a7e6 into master Jun 1, 2026
25 checks passed
@Chisanan232 Chisanan232 deleted the v0.0.1/AAASM-2328/fix/release_node_retry branch June 1, 2026 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant