Skip to content

fix(review): put contract-reliable reasoning models first, flagships last#318

Merged
seonghobae merged 1 commit into
mainfrom
fix/review-order-reasoning-first
Jul 5, 2026
Merged

fix(review): put contract-reliable reasoning models first, flagships last#318
seonghobae merged 1 commit into
mainfrom
fix/review-order-reasoning-first

Conversation

@seonghobae

Copy link
Copy Markdown
Contributor

배경

쿼터순(deepseek-v3 우선) 배치가 표면화한 문제: 비추론 고쿼터 모델이 리뷰 계약을 못 지킴 — 라벨 없는 bare APPROVE(#317로 repair), 이어서 환각성(비-source-backed) REQUEST_CHANGES finding → Approve PR 게이트가 거부 → 리뷰 실패(순환 안 함).

수정

계약을 안정적으로 지키는 mini 추론 모델(o4-mini·o3-mini·gpt-5-mini/nano/chat)을 1순위로. deepseek/mistral/llama는 폴백, 최소쿼터 flagship(o3·gpt-5)은 최후미(gpt-5 1순위는 6시간 정체 유발). #295 이전의 검증된 리드 모델(o4-mini)과 일치.

설정(ATTEMPTS/RUN_TIMEOUT/step timeout)은 미변경. 가드 테스트 갱신.

🤖 Generated with Claude Code

…last

Ordering the highest-quota non-reasoning models first (deepseek-v3, mistral,
llama-4) surfaced that they emit reviews the publish/approve gates reject — bare
APPROVE summaries missing required labels, then hallucinated non-source-backed
REQUEST_CHANGES findings — and the pool records them as success instead of
falling through. The mini reasoning models (o4-mini, o3-mini, gpt-5-mini/nano/
chat) reliably emit the strict contract and have adequate quota, so lead with
them; keep deepseek/mistral/llama as fallbacks and the rate-starved flagships
o3/gpt-5 last (first-placing gpt-5 stalled every review to timeout). Matches the
pre-#295 known-good lead model (o4-mini).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RTAMs4bpSZS77Xe3RQjv9P
@seonghobae seonghobae merged commit 3180040 into main Jul 5, 2026
11 of 12 checks passed

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

OpenCode cannot approve yet because required coverage evidence did not pass.

Review outcome

1. HIGH .github/workflows/opencode-review.yml:1 - Coverage evidence did not prove required test/docstring evidence

  • Problem: The required coverage-evidence job result was failure, so OpenCode cannot establish approval sufficiency for this head.

  • Root cause: Automated approval is only valid when the same-head coverage-evidence job proves supported repository test suites passed and configured docstring gates passed or were advisory, or reports not applicable because no supported source files or package manifests exist. Missing, failed, skipped, unavailable, or unsupported-tooling test evidence is a blocker.

  • Fix: Install or configure the repository test/docstring evidence tooling when source files or package manifests exist, rerun the current-head coverage-evidence job, and approve only after it reports success with required evidence or explicit no-source not-applicable evidence.

  • Regression test: Keep the approval branch checking needs.coverage-evidence.result == success before posting APPROVE, and publish REQUEST_CHANGES when coverage-evidence blocker states such as cancelled, skipped, failed, unsupported-tooling, or below-100 evidence are present.

  • Result: REQUEST_CHANGES

  • Reason: coverage-evidence result was failure, so required test/docstring evidence was not proven for current head 337a8bfc5039402da4d73687065d18f7ebec9beb.

  • Head SHA: 337a8bfc5039402da4d73687065d18f7ebec9beb

  • Workflow run: 28744364199

  • Workflow attempt: 1

Coverage evidence

Coverage Evidence

  • Head SHA: 337a8bfc5039402da4d73687065d18f7ebec9beb
  • Required test evidence: supported repository test suites must pass.
  • Required docstring evidence: repository-owned docstring gates must pass when configured; otherwise docstring coverage is advisory.

Python project dependencies (.)

Using CPython 3.12.3 interpreter at: /usr/bin/python3
Creating virtual environment at: .venv
Resolved 17 packages in 117ms
Downloading pygments (1.2MiB)
 Downloaded pygments
Prepared 13 packages in 89ms
Installed 13 packages in 11ms
 + attrs==26.1.0
 + click==8.4.2
 + colorama==0.4.6
 + coverage==7.15.0
 + iniconfig==2.3.0
 + interrogate==1.7.0
 + packaging==26.2
 + pluggy==1.6.0
 + py==1.11.0
 + pygments==2.20.0
 + pytest==9.1.1
 + pytest-cov==7.1.0
 + tabulate==0.10.0
  • Result: PASS

Python coverage with missing-line report (.)

============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-9.1.1, pluggy-1.6.0
rootdir: /home/runner/work/.github/.github/pr-head
configfile: pyproject.toml
plugins: cov-7.1.0
collected 166 items

tests/test_assert_opencode_reasoning_effort.py ........                  [  4%]
tests/test_codeql_pr_workflow_contract.py .                              [  5%]
tests/test_noema_review_gate.py .......F...                              [ 12%]
tests/test_opencode_agent_contract.py .............                      [ 19%]
tests/test_opencode_review_normalize_output.py ......................... [ 34%]
                                                                         [ 34%]
tests/test_opencode_workflow_shell_syntax.py .                           [ 35%]
tests/test_pr_governance_audit_contract.py ...                           [ 37%]
tests/test_pr_review_fix_scheduler.py ...................                [ 48%]
tests/test_pr_review_fix_scheduler_coverage.py ..                        [ 50%]
tests/test_pr_review_merge_scheduler.py ................................ [ 69%]
..............................                                           [ 87%]
tests/test_render_opencode_prompt_template.py ....                       [ 89%]
tests/test_review_execution_contracts.py ..                              [ 90%]
tests/test_sandboxed_verify.py .........                                 [ 96%]
tests/test_sandboxed_web_e2e.py ......                                   [100%]

=================================== FAILURES ===================================
_______________ test_call_llm_handles_configuration_and_verdicts _______________

monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fcf837b0c80>

    def test_call_llm_handles_configuration_and_verdicts(monkeypatch):
        pr = make_pr()
        monkeypatch.delenv("NOEMA_LLM_API_URL", raising=False)
        monkeypatch.delenv("NOEMA_LLM_API_KEY", raising=False)
        assert noema.call_llm("owner/repo", 1, pr, "diff", False) is None
    
        monkeypatch.setenv("NOEMA_LLM_API_URL", "file:///etc/passwd")
        monkeypatch.setenv("NOEMA_LLM_API_KEY", "secret")
>       with pytest.raises(ValueError, match="must start with http:// or https://"):
E       AssertionError: Regex pattern did not match.
E         Expected regex: 'must start with http:// or https://'
E         Actual message: 'URL scheme must be http or https'

tests/test_noema_review_gate.py:209: AssertionError
----------------------------- Captured stdout call -----------------------------
Noema LLM review unavailable: NOEMA_LLM_API_URL or NOEMA_LLM_API_KEY is not configured.
=============================== warnings summary ===============================
tests/test_assert_opencode_reasoning_effort.py::test_module_entrypoint_success
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.assert_opencode_reasoning_effort' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.assert_opencode_reasoning_effort'; this may result in unpredictable behaviour

tests/test_render_opencode_prompt_template.py::test_module_entrypoint
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.render_opencode_prompt_template' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.render_opencode_prompt_template'; this may result in unpredictable behaviour

tests/test_review_execution_contracts.py::test_discovers_package_managers_java_r_json_and_main
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.review_execution_contracts' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.review_execution_contracts'; this may result in unpredictable behaviour

tests/test_sandboxed_verify.py::test_module_main_entrypoint
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.sandboxed_verify' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.sandboxed_verify'; this may result in unpredictable behaviour

tests/test_sandboxed_web_e2e.py::test_module_import_and_main_entrypoint
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.sandboxed_web_e2e' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.sandboxed_web_e2e'; this may result in unpredictable behaviour

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/test_noema_review_gate.py::test_call_llm_handles_configuration_and_verdicts - AssertionError: Regex pattern did not match.
  Expected regex: 'must start with http:// or https://'
  Actual message: 'URL scheme must be http or https'
================== 1 failed, 165 passed, 5 warnings in 5.71s ===================
  • Result: FAIL (exit 1)

Python docstring coverage advisory

RESULT: PASSED (minimum: 100.0%, actual: 100.0%)
  • Result: PASS

Coverage Decision

  • Result: FAIL
  • Test evidence: not proven passing
  • Docstring evidence: not proven passing when configured
  • Failure count: 1

Changed-File Evidence Map

flowchart LR
  PR["PR changed files"] --> Evidence["OpenCode bounded evidence"]
  Evidence --> S1["Workflow: opencode-review.yml"]
  S1 --> I1["GitHub Actions review job"]
  I1 --> R1["Review risk: Workflow: opencode-review.yml"]
  R1 --> V1["actionlint plus required checks"]
  Evidence --> S2["Test: test_opencode_agent_contract.py"]
  S2 --> I2["regression suite"]
  I2 --> R2["Review risk: Test: test_opencode_agent_contract.py"]
  R2 --> V2["targeted test run"]
Loading

@github-actions

github-actions Bot commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

OpenCode Review Overview

  • Head SHA: 337a8bfc5039402da4d73687065d18f7ebec9beb
  • Workflow run: 28744364199
  • Workflow attempt: 1
  • Gate result: REQUEST_CHANGES (approval step)

Pull request overview

OpenCode cannot approve yet because required coverage evidence did not pass.

Review outcome

1. HIGH .github/workflows/opencode-review.yml:1 - Coverage evidence did not prove required test/docstring evidence

  • Problem: The required coverage-evidence job result was failure, so OpenCode cannot establish approval sufficiency for this head.

  • Root cause: Automated approval is only valid when the same-head coverage-evidence job proves supported repository test suites passed and configured docstring gates passed or were advisory, or reports not applicable because no supported source files or package manifests exist. Missing, failed, skipped, unavailable, or unsupported-tooling test evidence is a blocker.

  • Fix: Install or configure the repository test/docstring evidence tooling when source files or package manifests exist, rerun the current-head coverage-evidence job, and approve only after it reports success with required evidence or explicit no-source not-applicable evidence.

  • Regression test: Keep the approval branch checking needs.coverage-evidence.result == success before posting APPROVE, and publish REQUEST_CHANGES when coverage-evidence blocker states such as cancelled, skipped, failed, unsupported-tooling, or below-100 evidence are present.

  • Result: REQUEST_CHANGES

  • Reason: coverage-evidence result was failure, so required test/docstring evidence was not proven for current head 337a8bfc5039402da4d73687065d18f7ebec9beb.

  • Head SHA: 337a8bfc5039402da4d73687065d18f7ebec9beb

  • Workflow run: 28744364199

  • Workflow attempt: 1

Coverage evidence

Coverage Evidence

  • Head SHA: 337a8bfc5039402da4d73687065d18f7ebec9beb
  • Required test evidence: supported repository test suites must pass.
  • Required docstring evidence: repository-owned docstring gates must pass when configured; otherwise docstring coverage is advisory.

Python project dependencies (.)

Using CPython 3.12.3 interpreter at: /usr/bin/python3
Creating virtual environment at: .venv
Resolved 17 packages in 117ms
Downloading pygments (1.2MiB)
 Downloaded pygments
Prepared 13 packages in 89ms
Installed 13 packages in 11ms
 + attrs==26.1.0
 + click==8.4.2
 + colorama==0.4.6
 + coverage==7.15.0
 + iniconfig==2.3.0
 + interrogate==1.7.0
 + packaging==26.2
 + pluggy==1.6.0
 + py==1.11.0
 + pygments==2.20.0
 + pytest==9.1.1
 + pytest-cov==7.1.0
 + tabulate==0.10.0
  • Result: PASS

Python coverage with missing-line report (.)

============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-9.1.1, pluggy-1.6.0
rootdir: /home/runner/work/.github/.github/pr-head
configfile: pyproject.toml
plugins: cov-7.1.0
collected 166 items

tests/test_assert_opencode_reasoning_effort.py ........                  [  4%]
tests/test_codeql_pr_workflow_contract.py .                              [  5%]
tests/test_noema_review_gate.py .......F...                              [ 12%]
tests/test_opencode_agent_contract.py .............                      [ 19%]
tests/test_opencode_review_normalize_output.py ......................... [ 34%]
                                                                         [ 34%]
tests/test_opencode_workflow_shell_syntax.py .                           [ 35%]
tests/test_pr_governance_audit_contract.py ...                           [ 37%]
tests/test_pr_review_fix_scheduler.py ...................                [ 48%]
tests/test_pr_review_fix_scheduler_coverage.py ..                        [ 50%]
tests/test_pr_review_merge_scheduler.py ................................ [ 69%]
..............................                                           [ 87%]
tests/test_render_opencode_prompt_template.py ....                       [ 89%]
tests/test_review_execution_contracts.py ..                              [ 90%]
tests/test_sandboxed_verify.py .........                                 [ 96%]
tests/test_sandboxed_web_e2e.py ......                                   [100%]

=================================== FAILURES ===================================
_______________ test_call_llm_handles_configuration_and_verdicts _______________

monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fcf837b0c80>

    def test_call_llm_handles_configuration_and_verdicts(monkeypatch):
        pr = make_pr()
        monkeypatch.delenv("NOEMA_LLM_API_URL", raising=False)
        monkeypatch.delenv("NOEMA_LLM_API_KEY", raising=False)
        assert noema.call_llm("owner/repo", 1, pr, "diff", False) is None
    
        monkeypatch.setenv("NOEMA_LLM_API_URL", "file:///etc/passwd")
        monkeypatch.setenv("NOEMA_LLM_API_KEY", "secret")
>       with pytest.raises(ValueError, match="must start with http:// or https://"):
E       AssertionError: Regex pattern did not match.
E         Expected regex: 'must start with http:// or https://'
E         Actual message: 'URL scheme must be http or https'

tests/test_noema_review_gate.py:209: AssertionError
----------------------------- Captured stdout call -----------------------------
Noema LLM review unavailable: NOEMA_LLM_API_URL or NOEMA_LLM_API_KEY is not configured.
=============================== warnings summary ===============================
tests/test_assert_opencode_reasoning_effort.py::test_module_entrypoint_success
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.assert_opencode_reasoning_effort' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.assert_opencode_reasoning_effort'; this may result in unpredictable behaviour

tests/test_render_opencode_prompt_template.py::test_module_entrypoint
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.render_opencode_prompt_template' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.render_opencode_prompt_template'; this may result in unpredictable behaviour

tests/test_review_execution_contracts.py::test_discovers_package_managers_java_r_json_and_main
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.review_execution_contracts' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.review_execution_contracts'; this may result in unpredictable behaviour

tests/test_sandboxed_verify.py::test_module_main_entrypoint
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.sandboxed_verify' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.sandboxed_verify'; this may result in unpredictable behaviour

tests/test_sandboxed_web_e2e.py::test_module_import_and_main_entrypoint
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.sandboxed_web_e2e' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.sandboxed_web_e2e'; this may result in unpredictable behaviour

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/test_noema_review_gate.py::test_call_llm_handles_configuration_and_verdicts - AssertionError: Regex pattern did not match.
  Expected regex: 'must start with http:// or https://'
  Actual message: 'URL scheme must be http or https'
================== 1 failed, 165 passed, 5 warnings in 5.71s ===================
  • Result: FAIL (exit 1)

Python docstring coverage advisory

RESULT: PASSED (minimum: 100.0%, actual: 100.0%)
  • Result: PASS

Coverage Decision

  • Result: FAIL
  • Test evidence: not proven passing
  • Docstring evidence: not proven passing when configured
  • Failure count: 1

Changed-File Evidence Map

flowchart LR
  PR["PR changed files"] --> Evidence["OpenCode bounded evidence"]
  Evidence --> S1["Workflow: opencode-review.yml"]
  S1 --> I1["GitHub Actions review job"]
  I1 --> R1["Review risk: Workflow: opencode-review.yml"]
  R1 --> V1["actionlint plus required checks"]
  Evidence --> S2["Test: test_opencode_agent_contract.py"]
  S2 --> I2["regression suite"]
  I2 --> R2["Review risk: Test: test_opencode_agent_contract.py"]
  R2 --> V2["targeted test run"]
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant