Skip to content

feat(quorum): surface reviewer findings + best-of-N verdict sampling#8495

Draft
scarmani wants to merge 5 commits into
mainfrom
feat/reviewer-critic-surfacing
Draft

feat(quorum): surface reviewer findings + best-of-N verdict sampling#8495
scarmani wants to merge 5 commits into
mainfrom
feat/reviewer-critic-surfacing

Conversation

@scarmani

Copy link
Copy Markdown
Collaborator

Stacked on #8494 (base = fix/mergegate-reviewer-robustness). Rebase to main once #8494 lands.

Why

Thorough reviewers like grok produce genuinely valuable findings (it caught the Tier-4 auto-settle hole in #8485 and two real bugs in #8494) — but those findings were invisible unless you manually read raw output, and grok's counting verdict flaps run-to-run. This makes grok a better critic than voter; this PR captures its critic-value and dampens its voter-variance.

What

  • extract_findings() — markdown-tolerant [P0]-[P3] parser (- **[P2]**, 1. [p1], [P3]: all parse).
  • Findings surfaced in output_render_outcome now shows a per-reviewer digest (2×P2, 1×P3) and a dedicated "Advisory findings (non-counting reviewers)" section, so a critic's catches appear every run even when its body doesn't count toward quorum.
  • best-of-N verdict sampling_run_reviewer_best_of_n (env ARAGORA_REVIEWER_BEST_OF_N, default 1 = unchanged) samples a reviewer N times and takes the majority verdict; ties resolve to changes_requested (never invent support). Wired at the collect concurrency chokepoint.

Tests

findings extraction/digest, markdown tolerance, advisory surfacing, best-of-N majority/tie/single-sample. Full tests/swarm/test_quorum_evidence.py green (152 passed).

Notes

  • Counting/settlement semantics unchanged at default (best-of-N=1; advisory is display-only, not posted to GitHub yet — posting an advisory PR comment is a deliberate follow-up to keep this PR read-only-safe).

🤖 Generated with Claude Code

Base automatically changed from fix/mergegate-reviewer-robustness to main June 17, 2026 13:05
Get more value from thorough-but-flaky critics (esp. grok), whose findings were
previously invisible unless you read raw output:

- extract_findings(): markdown-tolerant [P0]-[P3] parser ("- **[P2]**", "1. [p1]",
  "[P3]:" all parse).
- _render_outcome now shows a per-reviewer findings digest ("2×P2, 1×P3") AND a
  dedicated "Advisory findings (non-counting reviewers)" section, so a critic's
  catches surface every run even when its body doesn't count toward quorum.
- best-of-N verdict sampling (_run_reviewer_best_of_n, env ARAGORA_REVIEWER_BEST_OF_N,
  default 1 = unchanged): samples a reviewer N times and takes the majority verdict
  to dampen run-to-run flapping; ties resolve to changes_requested (never invent
  support). Wired at the collect concurrency chokepoint.

Tests: findings extraction/digest, advisory surfacing, best-of-N majority/tie/single.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@scarmani scarmani force-pushed the feat/reviewer-critic-surfacing branch from 4b5fed4 to 5d7a0a5 Compare June 17, 2026 13:10
scarmani and others added 4 commits June 17, 2026 09:24
…ew finding)

deepseek review: a leading label like '**Note**: [P1] ...' was dropped because
the priority regex was anchored at line-start. Use .search (the bracketed
[P0-3] token is specific enough to avoid prose false positives) so labeled
findings still surface. Adds a Note:-prefixed test case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ings)

deepseek+grok review of #8495 found real issues (all fixed):
- extract_findings: .search regressed to match code subscripts (arr[p2]); add
  negative-lookbehind so only standalone [P0-3] tags match (Note: [P1] still does).
- _run_reviewer_best_of_n: degenerate all-'unknown' case fell through to a sample
  whose body didn't carry the fail-safe CHANGES-REQUESTED verdict (twice-flagged by
  deepseek AND grok) — now synthesizes the winning verdict onto the returned body.
- best-of-N now clamped to _MAX_REVIEWER_BEST_OF_N=5 (was unbounded → cost blowup).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ecision)

grok's review of the gate change correctly flagged that best-of-N majority voting
weakens the adversarial gate (2 PASS can outvote a genuine dissent, and the
all-unknown synthesize-verdict fix violated the module's 'never fabricate'
invariant). Per operator decision, remove best-of-N entirely; keep only the
uncontroversial findings-surfacing (extract_findings + always-on advisory
comments) which delivers the 'get value from grok' goal without auto-outvoting
dissent. Reviewer-run loop reverts to _run_reviewer_with_infra_retry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
deepseek [P1]: the advisory section appended unconstrained reviewer bodies
(100KB+ possible) with no cap → OOM risk in downstream report/log rendering.
Apply the existing _cap_text bound, matching every other reviewer-text path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@scarmani

Copy link
Copy Markdown
Collaborator Author

Closing this draft as part of the queue-drain substrate freeze path.

Reason: this PR changes quorum / reviewer-surfacing behavior, which is substrate merge-quorum tooling, and it is explicitly stacked on another branch rather than independently mergeable. The current drain mission avoids substrate meta-work. Current live checks show no active non-Codex lane owner, no unread steering, no reviews, and no local worktree to preserve. Keeping it open adds queue pressure without an autonomous drain path in this loop.

No branch deletion was requested; branch feat/reviewer-critic-surfacing is preserved for recovery if the operator wants to revive this tooling change outside the drain loop.

Head closed: 388f3a347b5d725f6bb542d921ad1a1369708425.

@scarmani scarmani closed this Jun 30, 2026
@scarmani scarmani reopened this Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant