Skip to content

fix(settlement): route Tier-4 read probes through app auth#8405

Open
an0mium wants to merge 6 commits into
mainfrom
codex/github-app-read-routing-20260614
Open

fix(settlement): route Tier-4 read probes through app auth#8405
an0mium wants to merge 6 commits into
mainfrom
codex/github-app-read-routing-20260614

Conversation

@an0mium

@an0mium an0mium commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Route scripts/settle_tier4_pr.py read-only GitHub probes through the existing GitHub App auth path so --check does not burn the active operator PAT GraphQL bucket.
  • Preserve human/operator identity boundaries: gh api user and all settlement/merge/protection writes force user auth instead of App auth.
  • Add scripts/github_token_doctor.py to report masked core/GraphQL capacity for an0mium, scarmani, and the GitHub App token without printing token values.
  • Add focused regressions for App-auth reads, user-auth writes, CWD preservation, and token-doctor masking/capacity behavior.

Live proof

Before this change, settle_tier4_pr.py --check for PR #8392 failed when the active an0mium GraphQL bucket was exhausted. With the helper using App-backed reads, the same command reaches settlement diagnostics instead of transport failure.

Current token doctor showed separate buckets:

  • an0mium: GraphQL capacity was low/degraded during this run.
  • scarmani: GraphQL capacity available.
  • GitHub App: separate App bucket available (core limit 15000, graphql limit 5000).

Validation

  • python3 -m pytest tests/scripts/test_settle_tier4_pr.py tests/swarm/test_github_app_auth.py tests/scripts/test_github_token_doctor.py -> 70 passed
  • pre-commit run --files scripts/settle_tier4_pr.py aragora/swarm/github_app_auth.py scripts/github_token_doctor.py tests/scripts/test_settle_tier4_pr.py tests/swarm/test_github_app_auth.py tests/scripts/test_github_token_doctor.py -> passed
  • bash scripts/automation_pr_preflight.sh origin/main HEAD -> passed
  • Pre-push hooks -> passed, including mypy on changed Aragora files

Governance

No settlement artifacts posted, no PR marked ready, no merge attempted, no branch protection/workflow changes.

scarmani and others added 2 commits June 13, 2026 23:54
Route Tier-4 settlement helper read-only GitHub probes through the existing GitHub App auth path while preserving user auth for identity checks and write operations. Add a masked token-capacity doctor for diagnosing repeated GraphQL exhaustion across operator and App identities.

Co-authored-by: codex[bot] <codex[bot]@users.noreply.github.com>
Fix the Tier-4 helper text-command stdin path and make the GitHub token doctor report rate-limit probe timeouts as unavailable sources instead of crashing. Add focused regressions for both review findings.

Co-authored-by: codex[bot] <codex[bot]@users.noreply.github.com>
@an0mium

an0mium commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator Author

Grok independent model review

Reviewer: grok (xai) — independent adversarial model review via the Aragora Grok reviewer, grounded on the exact PR head.
Head: daa394a (daa394a), committed 2026-06-14T05:20:29Z.
PR: #8405.
Model family: grok

Verdict: PASS

  • No blocking issues found. cwd propagation, auth preference toggles (prefer_app=False for identity), and fallback paths are correct with matching tests.
  • [P3] github_token_doctor.py: token = "" in finally blocks is a no-op on the local (not a secure erase); harmless for diagnostics but could be removed.
  • [P3] settle_tier4_pr.py: duplicate env construction in _run_command/_run_process paths; no functional regression but minor duplication risk on future edits.

dogfood: yes

@an0mium

an0mium commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator Author

OpenAI independent model review

Reviewer: openai (openai) — independent adversarial model review via Codex CLI OpenAI harness, grounded on the exact PR head.
Head: daa394a (daa394a), committed 2026-06-14T05:20:29Z.
PR: #8405.
Model family: openai

Verdict: PASS

  • [P3] scripts/github_token_doctor.py:101 _gh_user_token() does not catch TimeoutExpired/OSError, so a hung or missing gh auth token --user aborts the whole diagnostic instead of reporting that source unavailable. No blocking issues found.

dogfood: yes

@aragora-automation-fable aragora-automation-fable Bot marked this pull request as ready for review June 14, 2026 05:44
@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Aragora Code Review

Advisory-only review. Findings are surfaced for follow-up and do not fail this workflow.

Security Review: GitHub Token Doctor PR (Round 2 — Skeptical Self-Review)

What this round does

The Round 2 critic largely endorsed my revised position (S3 MEDIUM with the minter-exception path as load-bearing, Q1, and Q3-as-needs-verification). When a critic agrees with you, the skeptical move is not to take a victory lap — it's to turn the skepticism inward and stress-test the claims I'm most confident about. I did exactly that, and I found I had over-claimed in two places. I'm correcting them below, and I'm also surfacing edge cases that both prior rounds missed entirely.


Where I now think I was wrong (self-correction)

Correction 1 — I overstated that the rollout makes leakage "actively materializing." This is weaker than I claimed.

My Round 1 language ("the triggering conditions are now actively materializing rather than hypothetical") quietly smuggled in an unproven causal chain. Let me lay out the chain honestly and mark each link:

  1. The minter get_github_app_installation_token() must raise an exception (not return None or empty). — Unknown; code not in diff.
  2. That exception's str() must embed the token. — Unknown; depends entirely on the minter's implementation, which I cannot see.
  3. The new token format must trigger a parse/validation failure in that minter. — Speculative; assumes the minter validates token shape at all.

The CVE establishes that link 2 has happened in real GitHub tooling (Composer), which is genuinely useful — it proves the bug class is not imaginary. But it does not establish that this minter does any of links 1–3. My Round 1 phrasing blurred "this failure mode is real for GitHub tokens generally" into "this script is at active risk," and those are different claims. The honest statement is: the bug class is demonstrated; this instance is unverified. That is precisely the definition of a defensive-hardening finding — which is what the very first critique called it, and on reflection that framing was more accurate than I credited.

Net effect: I'm keeping S3 actionable but I am lowering its severity to LOW–MEDIUM, and I'll explain why I don't drop it all the way to LOW below.

Correction 2 — I missed that _format_capacity is the certain exposure, and it's broader than the token question.

Here's an edge case both prior rounds skated past: even with zero token leakage, the error field is unconditionally printed verbatim in the human-readable branch (f"{capacity.source}: unavailable ({capacity.error})") and unconditionally serialized in the --json branch. Whatever a subprocess or the minter puts in error, it reaches output with no filtering. That's a real, certain property of the code (not a hypothesis about token formats) — and it's the actual reason a redaction boundary is justified. I had buried this under the token-format argument. The certain claim ("all error text is printed unfiltered") is stronger and more defensible than the speculative one ("the token will be in that text"). I'm leading with the certain claim now.


Revised findings

S3 — Unfiltered error passthrough (REVISED → severity LOW–MEDIUM, reframed)

  1. Category: Security
  2. Severity: LOW–MEDIUM (downgraded from MEDIUM)
  3. Location: _format_capacity (the certain exposure), fed by _completed_error (~L45), _rate_limit_for_token (~L61), probe_app_token error=f"mint failed: {exc}" (~L118)
  4. Issue (reframed around the certain property):
    • Certain: Every error string is emitted unfiltered, both human-readable and JSON. The script's sole purpose is "never print token material," yet it has no boundary that enforces this on error paths — it relies entirely on no upstream component ever placing a secret in error.
    • Demonstrated bug class (not this instance): CVE-2026-45793 in Composer shows GitHub tokens being interpolated verbatim into exception messages that then print to logs — proving link 2 of my chain is real-world, not theoretical.
    • Honest uncertainty: Whether this minter raises, and whether its exception embeds the token, is unverified (code not in diff). The token-format rollout could increase the chance of a parse-failure exception, but that's a possibility, not a demonstrated trigger here.
    • Skeptical counter to my own redaction proposal: a regex redactor is itself an attack surface and a maintenance liability. A ghs_[A-Za-z0-9.\-_]{36,} pattern is greedy and could (a) miss a future token format with a different prefix — re-introducing exactly the false-confidence problem I flagged in S2 — and (b) corrupt legitimate diagnostic text. So redaction is a mitigation, not a guarantee.
  5. Suggestion (revised, lower-cost first):
    • Cheapest and most robust: for the probe_app_token minter path specifically, don't interpolate the raw exception at all — replace error=f"mint failed: {exc}" with error=f"mint failed: {type(exc).__name__}". This eliminates the one path where I can construct a plausible leak, costs nothing, has no false-positive surface, and doesn't depend on a regex keeping pace with token formats. This is strictly better than a redactor for that path.
    • For the gh-stderr paths, leave as-is (the env-passing design means stderr is very unlikely to contain the token) or apply the same type(exc).__name__ discipline only where exceptions are formatted.
    • If a redactor is still wanted, scope it narrowly and add a test asserting it leaves non-token text intact, not just that it strips tokens.

Why not full LOW: the cost of type(exc).__name__ is essentially zero and the failure mode, while unverified for this instance, is CVE-demonstrated for the token family. A free fix against a real bug class warrants staying above pure LOW. Why not MEDIUM: I cannot show this minter actually leaks, and I should not rate an unverified instance as MEDIUM just because the class is real. The first critic was right that I'd overstated it.

New finding — N1: _rate_limit_for_token returns source="" and relies on caller to patch it

  1. Category: Quality
  2. Severity: LOW
  3. Location: _rate_limit_for_token (every return uses source=""), _with_source (L83–95)
  4. Issue: _rate_limit_for_token produces TokenCapacity objects with an empty source, and correctness depends on every caller remembering to wrap the result in _with_source. The success paths do, but this is a latent footgun: a future caller that returns the raw result emits ": unavailable (...)" with a blank identity. This is the same fragility class as Q1, from a different angle.
  5. Suggestion: Pass source into _rate_limit_for_token directly, eliminating the post-hoc _with_source patch and the empty-string sentinel. This also dissolves Q1.

New finding — N2: minter exception handler can shadow KeyboardInterrupt/SystemExit? — checked, NOT an issue

Skeptically checking the broad except Exception in probe_app_token: Exception does not catch KeyboardInterrupt/SystemExit (those derive from BaseException), so Ctrl-C still works. No issue — I'm recording the check so it's clear I looked.

Q1 — _with_source field duplication — superseded by N1

Still valid as written (dataclasses.replace fixes it), but N1's restructuring is the better fix because it removes the need for the function at all. If the author prefers minimal change, dataclasses.replace(capacity, source=source) remains correct.

Q3 — import fallback gh_subprocess_run = None — held at "needs verification" (LOW)

Unchanged from Round 1. The diff genuinely does not contain the call sites; I will not assert a defect I cannot see. The github_cli_env fallback, by contrast, is visible and is a real callable — so only the None assignment is the open question. Suggestion stands: prefer a raising stub over None, narrow except Exception to except ImportError.

S1, S2, S4, P1, Q2, Q4, Q5 — Unchanged

As prior rounds. S2 note reinforced by this round's skepticism: my own S3 redaction proposal risks the same false-confidence I criticized in S2, which is why I now prefer not-interpolating over redaction.


Revised summary

ID Category Severity Confidence Action
S3 Security LOW–MEDIUM (↓ from MEDIUM) Certain: error text printed unfiltered. Speculative: this minter leaks a token Use type(exc).__name__ on the minter path (free, robust); redaction optional + tested for false positives
N1 Quality LOW High (visible in diff) Pass source into _rate_limit_for_token; removes empty-source footgun and Q1
Q3 Quality LOW Needs verification Raising stub over None; narrow to ImportError
Q1 Quality LOW High dataclasses.replace (or fold into N1)
S2 Security LOW High Clarify ineffective finally: token=""
P1 Perf LOW High Optional parallelize

What changed and why

  • Downgraded S3 to LOW–MEDIUM and reframed it. On skeptical re-reading, my Round 1 "actively materializing" language overstated an unproven causal chain. I separated the certain property (all error text prints unfiltered) from the speculative one (this minter leaks a token). The first critique's original framing — "defensive hardening, not a likely vulnerability" — was closer to correct than I conceded, and I should say so plainly.
  • Replaced redaction-first with not-interpolating-first. error=f"mint failed: {type(exc).__name__}" is a zero-cost fix with no false-positive surface, and it avoids the exact "redactor won't keep pace with new token formats" trap I myself raised in S2. This is the more defensible recommendation.
  • Added N1, a real, diff-visible fragility (empty-source sentinel) that both prior rounds missed while arguing about token formats.
  • Recorded N2 as a checked non-issue for transparency.

Where I disagree / hold firm

  • I disagree with dropping S3 to pure LOW. The certain property (no enforcement boundary on a script whose entire reason to exist is non-leakage) plus a free fix justifies keeping it actionable above LOW. But I concede MEDIUM was too high without instance-level evidence.
  • A note on the untrusted trending context (the "ponytail/laziest senior dev" item): it's irrelevant to this review and I've disregarded it as tainted data per instructions — flagging only to confirm I didn't let "the best code is code you never write" rhetoric talk me out of the N1 restructuring, which is justified on its own merits.

Scope and limitations

I still cannot see aragora.swarm.github_app_auth, get_github_app_installation_token()'s exception behavior, or the gh_subprocess_run call sites. The CVE citation establishes the bug class for GitHub tokens; it does not prove this specific minter leaks — which is exactly why I now recommend the cheapest leak-proof fix (type(exc).__name__) rather than a higher-confidence severity rating. Code-structure findings (N1, Q1–Q5, S2, P1, N2) are

1 finding(s) across the diff

[CRITICAL] Finding

Finding


Generated by Aragora multi-agent code review

@an0mium

an0mium commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator Author

Tier-4 Human Settlement Authorization

PR: #8405
Exact head: daa394a
Authorized action: admin_squash_merge and branch_protection_reconcile, only if #8405 is non-draft and live exact-head checks/merge-packet remain otherwise green.

Human-risk settlement: I accept the Tier 4 risk for this PR.

1 similar comment
@scarmani

Copy link
Copy Markdown
Collaborator

Tier-4 Human Settlement Authorization

PR: #8405
Exact head: daa394a
Authorized action: admin_squash_merge and branch_protection_reconcile, only if #8405 is non-draft and live exact-head checks/merge-packet remain otherwise green.

Human-risk settlement: I accept the Tier 4 risk for this PR.

@scarmani

Copy link
Copy Markdown
Collaborator

Codex/OpenAI focused dogfood evidence

Reviewer: Codex/OpenAI independent lane
PR: #8405
Exact head: b26c9b0
Model family: openai
Files reviewed: aragora/swarm/github_app_auth.py; scripts/github_token_doctor.py; scripts/settle_tier4_pr.py; tests/scripts/test_github_token_doctor.py; tests/scripts/test_settle_tier4_pr.py; tests/swarm/test_github_app_auth.py

Blocking findings: none found.

Focused adversarial dogfood verdict: PASS for the GitHub App read-routing and token-doctor change under GitHub GraphQL exhaustion. The diff keeps Tier 4 operator identity checks on user auth via prefer_app=false, routes read-only helper probes through the app-auth path with cwd preservation, and leaves write paths using write_op=true so the existing non-App write identity boundary is preserved. The token doctor reports capacity by source without printing token material, and the tests cover app read routing, user-auth identity/write paths, cwd forwarding, and masked quota reporting.

Validation observed: PR body reports focused pytest/preflight validation; live required non-quorum checks are green; merge-packet and failed aragora-merge-quorum job both report the current blocker as missing model quorum/focused evidence for this exact head. I did not rerun CI or local tests in this evidence pass.

dogfood: yes

This is not merge authorization.

@scarmani

Copy link
Copy Markdown
Collaborator

Claude independent model review

Reviewer: claude (anthropic) — independent adversarial model review via the Aragora Claude reviewer, grounded on the exact PR head.
Head: b26c9b0 (b26c9b0), committed 2026-06-15T03:12:31Z.
PR: #8405.
Model family: claude

Verdict: PASS

No blocking issues. A few minor observations:

  • [P3] scripts/settle_tier4_pr.py:_run_command — gh write-op output is now captured via gh_subprocess_run and discarded on success. Previously streamed to the parent terminal. Failures still surface via CalledProcessError(output=..., stderr=...), but live operator visibility for gh pr merge/comment is reduced. Cosmetic.
  • [P3] scripts/github_token_doctor.py:probe_app_tokenf"mint failed: {exc}" returns the raw exception text in TokenCapacity.error, which is then printed/JSON-emitted. The mint code paths are unlikely to embed tokens in exception strings, and noqa is annotated, but worth keeping in mind if get_github_app_installation_token ever raises with a body that includes credential material.
  • [P3] scripts/github_token_doctor.pytoken = "" in the finally blocks of probe_gh_user/probe_app_token is security theater: it rebinds the local without zeroing the original string. Fine as intent documentation, but doesn't actually erase memory.
  • [P3] scripts/settle_tier4_pr.py:_subprocess_env is applied to the direct subprocess.run fallback as well, so non-gh commands routed through _run_process/_run_text_command now receive GH_TOKEN/GITHUB_TOKEN from github_cli_env. Current callers are gh/git so this is benign, but future non-gh callers will silently inherit GitHub auth env vars.

Identity-preservation logic is correct: _current_gh_login explicitly forces prefer_app=False, and _run_command/_run_text_command correctly pass write_op=True so writes use operator identity, while reads opportunistically use App-token capacity. Tests cover the cwd plumbing, prefer_app routing for reads vs. writes, and identity preservation. The fallback stub for partial checkouts matches the production signature.

dogfood: yes

@scarmani

Copy link
Copy Markdown
Collaborator

Grok independent model review

Reviewer: grok (xai) — independent adversarial model review via the Aragora Grok reviewer, grounded on the exact PR head.
Head: b26c9b0 (b26c9b0), committed 2026-06-15T03:12:31Z.
PR: #8405.
Model family: grok

Verdict: PASS

  • No blocking issues. All auth routing, cwd propagation, token hygiene, and fallback paths are correct with no regressions or security exposure.

dogfood: yes

@scarmani

Copy link
Copy Markdown
Collaborator

Tier-4 Human Settlement Authorization

PR: #8405
Exact head: b26c9b0
Authorized action: admin_squash_merge and branch_protection_reconcile, only if #8405 is non-draft and live exact-head checks/merge-packet remain otherwise green.

Human-risk settlement: I accept the Tier 4 risk for this PR.

@scarmani

Copy link
Copy Markdown
Collaborator

Codex/OpenAI focused dogfood evidence

Reviewer: Codex/OpenAI independent lane
PR: #8405
Full head SHA: b0906cf
Model family: openai
Files reviewed: scripts/github_token_doctor.py; scripts/settle_tier4_pr.py; tests/scripts/test_github_token_doctor.py; tests/scripts/test_settle_tier4_pr.py; tests/swarm/test_github_app_auth.py

Blocking findings: none found.

Focused adversarial dogfood verdict: PASS for the GitHub App read routing, token-doctor, and Tier 4 settlement helper behavior under GitHub GraphQL exhaustion after the origin/main restack. The diff preserves the operator/user identity boundary for Tier 4 settlement writes and identity checks while allowing read-only GitHub probes to use App-backed quota; github_token_doctor.py reports per-source capacity without emitting token values; and the tests cover App-read routing, user-auth write/identity behavior, cwd forwarding, stdin preservation on non-gh helper paths, and masked capacity reporting. I did not find a blocking regression in the GraphQL-exhaustion fallback or exact-head settlement check path.

Validation observed: PR body reports targeted pytest, pre-commit, automation preflight, and pre-push validation; live required checks show all non-quorum required checks passing, with only aragora-merge-quorum failing on missing model quorum/focused dogfood evidence. Validation not run: I did not rerun CI or local tests in this evidence pass.

dogfood: yes

This is not merge authorization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants