feat(quorum): surface reviewer findings + best-of-N verdict sampling#8495
Draft
scarmani wants to merge 5 commits into
Draft
feat(quorum): surface reviewer findings + best-of-N verdict sampling#8495scarmani wants to merge 5 commits into
scarmani wants to merge 5 commits into
Conversation
Get more value from thorough-but-flaky critics (esp. grok), whose findings were
previously invisible unless you read raw output:
- extract_findings(): markdown-tolerant [P0]-[P3] parser ("- **[P2]**", "1. [p1]",
"[P3]:" all parse).
- _render_outcome now shows a per-reviewer findings digest ("2×P2, 1×P3") AND a
dedicated "Advisory findings (non-counting reviewers)" section, so a critic's
catches surface every run even when its body doesn't count toward quorum.
- best-of-N verdict sampling (_run_reviewer_best_of_n, env ARAGORA_REVIEWER_BEST_OF_N,
default 1 = unchanged): samples a reviewer N times and takes the majority verdict
to dampen run-to-run flapping; ties resolve to changes_requested (never invent
support). Wired at the collect concurrency chokepoint.
Tests: findings extraction/digest, advisory surfacing, best-of-N majority/tie/single.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4b5fed4 to
5d7a0a5
Compare
…ew finding) deepseek review: a leading label like '**Note**: [P1] ...' was dropped because the priority regex was anchored at line-start. Use .search (the bracketed [P0-3] token is specific enough to avoid prose false positives) so labeled findings still surface. Adds a Note:-prefixed test case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ings) deepseek+grok review of #8495 found real issues (all fixed): - extract_findings: .search regressed to match code subscripts (arr[p2]); add negative-lookbehind so only standalone [P0-3] tags match (Note: [P1] still does). - _run_reviewer_best_of_n: degenerate all-'unknown' case fell through to a sample whose body didn't carry the fail-safe CHANGES-REQUESTED verdict (twice-flagged by deepseek AND grok) — now synthesizes the winning verdict onto the returned body. - best-of-N now clamped to _MAX_REVIEWER_BEST_OF_N=5 (was unbounded → cost blowup). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ecision) grok's review of the gate change correctly flagged that best-of-N majority voting weakens the adversarial gate (2 PASS can outvote a genuine dissent, and the all-unknown synthesize-verdict fix violated the module's 'never fabricate' invariant). Per operator decision, remove best-of-N entirely; keep only the uncontroversial findings-surfacing (extract_findings + always-on advisory comments) which delivers the 'get value from grok' goal without auto-outvoting dissent. Reviewer-run loop reverts to _run_reviewer_with_infra_retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
deepseek [P1]: the advisory section appended unconstrained reviewer bodies (100KB+ possible) with no cap → OOM risk in downstream report/log rendering. Apply the existing _cap_text bound, matching every other reviewer-text path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 17, 2026
Collaborator
Author
|
Closing this draft as part of the queue-drain substrate freeze path. Reason: this PR changes quorum / reviewer-surfacing behavior, which is substrate merge-quorum tooling, and it is explicitly stacked on another branch rather than independently mergeable. The current drain mission avoids substrate meta-work. Current live checks show no active non-Codex lane owner, no unread steering, no reviews, and no local worktree to preserve. Keeping it open adds queue pressure without an autonomous drain path in this loop. No branch deletion was requested; branch Head closed: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #8494 (base =
fix/mergegate-reviewer-robustness). Rebase tomainonce #8494 lands.Why
Thorough reviewers like grok produce genuinely valuable findings (it caught the Tier-4 auto-settle hole in #8485 and two real bugs in #8494) — but those findings were invisible unless you manually read raw output, and grok's counting verdict flaps run-to-run. This makes grok a better critic than voter; this PR captures its critic-value and dampens its voter-variance.
What
extract_findings()— markdown-tolerant[P0]-[P3]parser (- **[P2]**,1. [p1],[P3]:all parse)._render_outcomenow shows a per-reviewer digest (2×P2, 1×P3) and a dedicated "Advisory findings (non-counting reviewers)" section, so a critic's catches appear every run even when its body doesn't count toward quorum._run_reviewer_best_of_n(envARAGORA_REVIEWER_BEST_OF_N, default 1 = unchanged) samples a reviewer N times and takes the majority verdict; ties resolve tochanges_requested(never invent support). Wired at the collect concurrency chokepoint.Tests
findings extraction/digest, markdown tolerance, advisory surfacing, best-of-N majority/tie/single-sample. Full
tests/swarm/test_quorum_evidence.pygreen (152 passed).Notes
🤖 Generated with Claude Code