feat(quorum): surface reviewer findings + best-of-N verdict sampling by scarmani · Pull Request #8495 · synaptent/aragora

scarmani · 2026-06-17T04:32:54Z

Stacked on #8494 (base = fix/mergegate-reviewer-robustness). Rebase to main once #8494 lands.

Why

Thorough reviewers like grok produce genuinely valuable findings (it caught the Tier-4 auto-settle hole in #8485 and two real bugs in #8494) — but those findings were invisible unless you manually read raw output, and grok's counting verdict flaps run-to-run. This makes grok a better critic than voter; this PR captures its critic-value and dampens its voter-variance.

What

extract_findings() — markdown-tolerant [P0]-[P3] parser (- **[P2]**, 1. [p1], [P3]: all parse).
Findings surfaced in output — _render_outcome now shows a per-reviewer digest (2×P2, 1×P3) and a dedicated "Advisory findings (non-counting reviewers)" section, so a critic's catches appear every run even when its body doesn't count toward quorum.
best-of-N verdict sampling — _run_reviewer_best_of_n (env ARAGORA_REVIEWER_BEST_OF_N, default 1 = unchanged) samples a reviewer N times and takes the majority verdict; ties resolve to changes_requested (never invent support). Wired at the collect concurrency chokepoint.

Tests

findings extraction/digest, markdown tolerance, advisory surfacing, best-of-N majority/tie/single-sample. Full tests/swarm/test_quorum_evidence.py green (152 passed).

Notes

Counting/settlement semantics unchanged at default (best-of-N=1; advisory is display-only, not posted to GitHub yet — posting an advisory PR comment is a deliberate follow-up to keep this PR read-only-safe).

🤖 Generated with Claude Code

Get more value from thorough-but-flaky critics (esp. grok), whose findings were previously invisible unless you read raw output: - extract_findings(): markdown-tolerant [P0]-[P3] parser ("- **[P2]**", "1. [p1]", "[P3]:" all parse). - _render_outcome now shows a per-reviewer findings digest ("2×P2, 1×P3") AND a dedicated "Advisory findings (non-counting reviewers)" section, so a critic's catches surface every run even when its body doesn't count toward quorum. - best-of-N verdict sampling (_run_reviewer_best_of_n, env ARAGORA_REVIEWER_BEST_OF_N, default 1 = unchanged): samples a reviewer N times and takes the majority verdict to dampen run-to-run flapping; ties resolve to changes_requested (never invent support). Wired at the collect concurrency chokepoint. Tests: findings extraction/digest, advisory surfacing, best-of-N majority/tie/single. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ew finding) deepseek review: a leading label like '**Note**: [P1] ...' was dropped because the priority regex was anchored at line-start. Use .search (the bracketed [P0-3] token is specific enough to avoid prose false positives) so labeled findings still surface. Adds a Note:-prefixed test case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ings) deepseek+grok review of #8495 found real issues (all fixed): - extract_findings: .search regressed to match code subscripts (arr[p2]); add negative-lookbehind so only standalone [P0-3] tags match (Note: [P1] still does). - _run_reviewer_best_of_n: degenerate all-'unknown' case fell through to a sample whose body didn't carry the fail-safe CHANGES-REQUESTED verdict (twice-flagged by deepseek AND grok) — now synthesizes the winning verdict onto the returned body. - best-of-N now clamped to _MAX_REVIEWER_BEST_OF_N=5 (was unbounded → cost blowup). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ecision) grok's review of the gate change correctly flagged that best-of-N majority voting weakens the adversarial gate (2 PASS can outvote a genuine dissent, and the all-unknown synthesize-verdict fix violated the module's 'never fabricate' invariant). Per operator decision, remove best-of-N entirely; keep only the uncontroversial findings-surfacing (extract_findings + always-on advisory comments) which delivers the 'get value from grok' goal without auto-outvoting dissent. Reviewer-run loop reverts to _run_reviewer_with_infra_retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

deepseek [P1]: the advisory section appended unconstrained reviewer bodies (100KB+ possible) with no cap → OOM risk in downstream report/log rendering. Apply the existing _cap_text bound, matching every other reviewer-text path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scarmani · 2026-06-30T07:15:35Z

Closing this draft as part of the queue-drain substrate freeze path.

Reason: this PR changes quorum / reviewer-surfacing behavior, which is substrate merge-quorum tooling, and it is explicitly stacked on another branch rather than independently mergeable. The current drain mission avoids substrate meta-work. Current live checks show no active non-Codex lane owner, no unread steering, no reviews, and no local worktree to preserve. Keeping it open adds queue pressure without an autonomous drain path in this loop.

No branch deletion was requested; branch feat/reviewer-critic-surfacing is preserved for recovery if the operator wants to revive this tooling change outside the drain loop.

Head closed: 388f3a347b5d725f6bb542d921ad1a1369708425.

Base automatically changed from fix/mergegate-reviewer-robustness to main June 17, 2026 13:05

scarmani force-pushed the feat/reviewer-critic-surfacing branch from 4b5fed4 to 5d7a0a5 Compare June 17, 2026 13:10

scarmani and others added 4 commits June 17, 2026 09:24

This was referenced Jun 17, 2026

[automation] Stage-Gate Conductor Log #7162

Open

[stage-gate] PR drain at 51 open — 6.4× the ≤8 healthy bound #7467

Open

[automation] Founder Brief #6106

Open

scarmani closed this Jun 30, 2026

scarmani reopened this Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(quorum): surface reviewer findings + best-of-N verdict sampling#8495

feat(quorum): surface reviewer findings + best-of-N verdict sampling#8495
scarmani wants to merge 5 commits into
mainfrom
feat/reviewer-critic-surfacing

scarmani commented Jun 17, 2026

Uh oh!

scarmani commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

scarmani commented Jun 17, 2026

Why

What

Tests

Notes

Uh oh!

scarmani commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant