fix(review): put contract-reliable reasoning models first, flagships last by seonghobae · Pull Request #318 · ContextualWisdomLab/.github

seonghobae · 2026-07-05T14:41:34Z

배경

쿼터순(deepseek-v3 우선) 배치가 표면화한 문제: 비추론 고쿼터 모델이 리뷰 계약을 못 지킴 — 라벨 없는 bare APPROVE(#317로 repair), 이어서 환각성(비-source-backed) REQUEST_CHANGES finding → Approve PR 게이트가 거부 → 리뷰 실패(순환 안 함).

수정

계약을 안정적으로 지키는 mini 추론 모델(o4-mini·o3-mini·gpt-5-mini/nano/chat)을 1순위로. deepseek/mistral/llama는 폴백, 최소쿼터 flagship(o3·gpt-5)은 최후미(gpt-5 1순위는 6시간 정체 유발). #295 이전의 검증된 리드 모델(o4-mini)과 일치.

설정(ATTEMPTS/RUN_TIMEOUT/step timeout)은 미변경. 가드 테스트 갱신.

🤖 Generated with Claude Code

…last Ordering the highest-quota non-reasoning models first (deepseek-v3, mistral, llama-4) surfaced that they emit reviews the publish/approve gates reject — bare APPROVE summaries missing required labels, then hallucinated non-source-backed REQUEST_CHANGES findings — and the pool records them as success instead of falling through. The mini reasoning models (o4-mini, o3-mini, gpt-5-mini/nano/ chat) reliably emit the strict contract and have adequate quota, so lead with them; keep deepseek/mistral/llama as fallbacks and the rate-starved flagships o3/gpt-5 last (first-placing gpt-5 stalled every review to timeout). Matches the pre-#295 known-good lead model (o4-mini). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RTAMs4bpSZS77Xe3RQjv9P

github-actions

Pull request overview

OpenCode cannot approve yet because required coverage evidence did not pass.

Review outcome

1. HIGH .github/workflows/opencode-review.yml:1 - Coverage evidence did not prove required test/docstring evidence

Problem: The required coverage-evidence job result was failure, so OpenCode cannot establish approval sufficiency for this head.
Root cause: Automated approval is only valid when the same-head coverage-evidence job proves supported repository test suites passed and configured docstring gates passed or were advisory, or reports not applicable because no supported source files or package manifests exist. Missing, failed, skipped, unavailable, or unsupported-tooling test evidence is a blocker.
Fix: Install or configure the repository test/docstring evidence tooling when source files or package manifests exist, rerun the current-head coverage-evidence job, and approve only after it reports success with required evidence or explicit no-source not-applicable evidence.
Regression test: Keep the approval branch checking needs.coverage-evidence.result == success before posting APPROVE, and publish REQUEST_CHANGES when coverage-evidence blocker states such as cancelled, skipped, failed, unsupported-tooling, or below-100 evidence are present.
Result: REQUEST_CHANGES
Reason: coverage-evidence result was failure, so required test/docstring evidence was not proven for current head 337a8bfc5039402da4d73687065d18f7ebec9beb.
Head SHA: 337a8bfc5039402da4d73687065d18f7ebec9beb
Workflow run: 28744364199
Workflow attempt: 1

Coverage evidence

Coverage Evidence

Head SHA: 337a8bfc5039402da4d73687065d18f7ebec9beb
Required test evidence: supported repository test suites must pass.
Required docstring evidence: repository-owned docstring gates must pass when configured; otherwise docstring coverage is advisory.

Python project dependencies (.)

Using CPython 3.12.3 interpreter at: /usr/bin/python3
Creating virtual environment at: .venv
Resolved 17 packages in 117ms
Downloading pygments (1.2MiB)
 Downloaded pygments
Prepared 13 packages in 89ms
Installed 13 packages in 11ms
 + attrs==26.1.0
 + click==8.4.2
 + colorama==0.4.6
 + coverage==7.15.0
 + iniconfig==2.3.0
 + interrogate==1.7.0
 + packaging==26.2
 + pluggy==1.6.0
 + py==1.11.0
 + pygments==2.20.0
 + pytest==9.1.1
 + pytest-cov==7.1.0
 + tabulate==0.10.0

Result: PASS

Python coverage with missing-line report (.)

============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-9.1.1, pluggy-1.6.0
rootdir: /home/runner/work/.github/.github/pr-head
configfile: pyproject.toml
plugins: cov-7.1.0
collected 166 items

tests/test_assert_opencode_reasoning_effort.py ........                  [  4%]
tests/test_codeql_pr_workflow_contract.py .                              [  5%]
tests/test_noema_review_gate.py .......F...                              [ 12%]
tests/test_opencode_agent_contract.py .............                      [ 19%]
tests/test_opencode_review_normalize_output.py ......................... [ 34%]
                                                                         [ 34%]
tests/test_opencode_workflow_shell_syntax.py .                           [ 35%]
tests/test_pr_governance_audit_contract.py ...                           [ 37%]
tests/test_pr_review_fix_scheduler.py ...................                [ 48%]
tests/test_pr_review_fix_scheduler_coverage.py ..                        [ 50%]
tests/test_pr_review_merge_scheduler.py ................................ [ 69%]
..............................                                           [ 87%]
tests/test_render_opencode_prompt_template.py ....                       [ 89%]
tests/test_review_execution_contracts.py ..                              [ 90%]
tests/test_sandboxed_verify.py .........                                 [ 96%]
tests/test_sandboxed_web_e2e.py ......                                   [100%]

=================================== FAILURES ===================================
_______________ test_call_llm_handles_configuration_and_verdicts _______________

monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fcf837b0c80>

    def test_call_llm_handles_configuration_and_verdicts(monkeypatch):
        pr = make_pr()
        monkeypatch.delenv("NOEMA_LLM_API_URL", raising=False)
        monkeypatch.delenv("NOEMA_LLM_API_KEY", raising=False)
        assert noema.call_llm("owner/repo", 1, pr, "diff", False) is None
    
        monkeypatch.setenv("NOEMA_LLM_API_URL", "file:///etc/passwd")
        monkeypatch.setenv("NOEMA_LLM_API_KEY", "secret")
>       with pytest.raises(ValueError, match="must start with http:// or https://"):
E       AssertionError: Regex pattern did not match.
E         Expected regex: 'must start with http:// or https://'
E         Actual message: 'URL scheme must be http or https'

tests/test_noema_review_gate.py:209: AssertionError
----------------------------- Captured stdout call -----------------------------
Noema LLM review unavailable: NOEMA_LLM_API_URL or NOEMA_LLM_API_KEY is not configured.
=============================== warnings summary ===============================
tests/test_assert_opencode_reasoning_effort.py::test_module_entrypoint_success
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.assert_opencode_reasoning_effort' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.assert_opencode_reasoning_effort'; this may result in unpredictable behaviour

tests/test_render_opencode_prompt_template.py::test_module_entrypoint
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.render_opencode_prompt_template' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.render_opencode_prompt_template'; this may result in unpredictable behaviour

tests/test_review_execution_contracts.py::test_discovers_package_managers_java_r_json_and_main
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.review_execution_contracts' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.review_execution_contracts'; this may result in unpredictable behaviour

tests/test_sandboxed_verify.py::test_module_main_entrypoint
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.sandboxed_verify' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.sandboxed_verify'; this may result in unpredictable behaviour

tests/test_sandboxed_web_e2e.py::test_module_import_and_main_entrypoint
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.sandboxed_web_e2e' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.sandboxed_web_e2e'; this may result in unpredictable behaviour

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/test_noema_review_gate.py::test_call_llm_handles_configuration_and_verdicts - AssertionError: Regex pattern did not match.
  Expected regex: 'must start with http:// or https://'
  Actual message: 'URL scheme must be http or https'
================== 1 failed, 165 passed, 5 warnings in 5.71s ===================

Result: FAIL (exit 1)

Python docstring coverage advisory

RESULT: PASSED (minimum: 100.0%, actual: 100.0%)

Result: PASS

Coverage Decision

Result: FAIL
Test evidence: not proven passing
Docstring evidence: not proven passing when configured
Failure count: 1

Changed-File Evidence Map

flowchart LR
  PR["PR changed files"] --> Evidence["OpenCode bounded evidence"]
  Evidence --> S1["Workflow: opencode-review.yml"]
  S1 --> I1["GitHub Actions review job"]
  I1 --> R1["Review risk: Workflow: opencode-review.yml"]
  R1 --> V1["actionlint plus required checks"]
  Evidence --> S2["Test: test_opencode_agent_contract.py"]
  S2 --> I2["regression suite"]
  I2 --> R2["Review risk: Test: test_opencode_agent_contract.py"]
  R2 --> V2["targeted test run"]

github-actions · 2026-07-05T14:48:56Z

OpenCode Review Overview

Head SHA: 337a8bfc5039402da4d73687065d18f7ebec9beb
Workflow run: 28744364199
Workflow attempt: 1
Gate result: REQUEST_CHANGES (approval step)

Pull request overview

OpenCode cannot approve yet because required coverage evidence did not pass.

Review outcome

1. HIGH .github/workflows/opencode-review.yml:1 - Coverage evidence did not prove required test/docstring evidence

Problem: The required coverage-evidence job result was failure, so OpenCode cannot establish approval sufficiency for this head.
Root cause: Automated approval is only valid when the same-head coverage-evidence job proves supported repository test suites passed and configured docstring gates passed or were advisory, or reports not applicable because no supported source files or package manifests exist. Missing, failed, skipped, unavailable, or unsupported-tooling test evidence is a blocker.
Fix: Install or configure the repository test/docstring evidence tooling when source files or package manifests exist, rerun the current-head coverage-evidence job, and approve only after it reports success with required evidence or explicit no-source not-applicable evidence.
Regression test: Keep the approval branch checking needs.coverage-evidence.result == success before posting APPROVE, and publish REQUEST_CHANGES when coverage-evidence blocker states such as cancelled, skipped, failed, unsupported-tooling, or below-100 evidence are present.
Result: REQUEST_CHANGES
Reason: coverage-evidence result was failure, so required test/docstring evidence was not proven for current head 337a8bfc5039402da4d73687065d18f7ebec9beb.
Head SHA: 337a8bfc5039402da4d73687065d18f7ebec9beb
Workflow run: 28744364199
Workflow attempt: 1

Coverage evidence

Coverage Evidence

Head SHA: 337a8bfc5039402da4d73687065d18f7ebec9beb
Required test evidence: supported repository test suites must pass.
Required docstring evidence: repository-owned docstring gates must pass when configured; otherwise docstring coverage is advisory.

Python project dependencies (.)

Using CPython 3.12.3 interpreter at: /usr/bin/python3
Creating virtual environment at: .venv
Resolved 17 packages in 117ms
Downloading pygments (1.2MiB)
 Downloaded pygments
Prepared 13 packages in 89ms
Installed 13 packages in 11ms
 + attrs==26.1.0
 + click==8.4.2
 + colorama==0.4.6
 + coverage==7.15.0
 + iniconfig==2.3.0
 + interrogate==1.7.0
 + packaging==26.2
 + pluggy==1.6.0
 + py==1.11.0
 + pygments==2.20.0
 + pytest==9.1.1
 + pytest-cov==7.1.0
 + tabulate==0.10.0

Result: PASS

Python coverage with missing-line report (.)

============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-9.1.1, pluggy-1.6.0
rootdir: /home/runner/work/.github/.github/pr-head
configfile: pyproject.toml
plugins: cov-7.1.0
collected 166 items

tests/test_assert_opencode_reasoning_effort.py ........                  [  4%]
tests/test_codeql_pr_workflow_contract.py .                              [  5%]
tests/test_noema_review_gate.py .......F...                              [ 12%]
tests/test_opencode_agent_contract.py .............                      [ 19%]
tests/test_opencode_review_normalize_output.py ......................... [ 34%]
                                                                         [ 34%]
tests/test_opencode_workflow_shell_syntax.py .                           [ 35%]
tests/test_pr_governance_audit_contract.py ...                           [ 37%]
tests/test_pr_review_fix_scheduler.py ...................                [ 48%]
tests/test_pr_review_fix_scheduler_coverage.py ..                        [ 50%]
tests/test_pr_review_merge_scheduler.py ................................ [ 69%]
..............................                                           [ 87%]
tests/test_render_opencode_prompt_template.py ....                       [ 89%]
tests/test_review_execution_contracts.py ..                              [ 90%]
tests/test_sandboxed_verify.py .........                                 [ 96%]
tests/test_sandboxed_web_e2e.py ......                                   [100%]

=================================== FAILURES ===================================
_______________ test_call_llm_handles_configuration_and_verdicts _______________

monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fcf837b0c80>

    def test_call_llm_handles_configuration_and_verdicts(monkeypatch):
        pr = make_pr()
        monkeypatch.delenv("NOEMA_LLM_API_URL", raising=False)
        monkeypatch.delenv("NOEMA_LLM_API_KEY", raising=False)
        assert noema.call_llm("owner/repo", 1, pr, "diff", False) is None
    
        monkeypatch.setenv("NOEMA_LLM_API_URL", "file:///etc/passwd")
        monkeypatch.setenv("NOEMA_LLM_API_KEY", "secret")
>       with pytest.raises(ValueError, match="must start with http:// or https://"):
E       AssertionError: Regex pattern did not match.
E         Expected regex: 'must start with http:// or https://'
E         Actual message: 'URL scheme must be http or https'

tests/test_noema_review_gate.py:209: AssertionError
----------------------------- Captured stdout call -----------------------------
Noema LLM review unavailable: NOEMA_LLM_API_URL or NOEMA_LLM_API_KEY is not configured.
=============================== warnings summary ===============================
tests/test_assert_opencode_reasoning_effort.py::test_module_entrypoint_success
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.assert_opencode_reasoning_effort' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.assert_opencode_reasoning_effort'; this may result in unpredictable behaviour

tests/test_render_opencode_prompt_template.py::test_module_entrypoint
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.render_opencode_prompt_template' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.render_opencode_prompt_template'; this may result in unpredictable behaviour

tests/test_review_execution_contracts.py::test_discovers_package_managers_java_r_json_and_main
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.review_execution_contracts' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.review_execution_contracts'; this may result in unpredictable behaviour

tests/test_sandboxed_verify.py::test_module_main_entrypoint
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.sandboxed_verify' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.sandboxed_verify'; this may result in unpredictable behaviour

tests/test_sandboxed_web_e2e.py::test_module_import_and_main_entrypoint
  <frozen runpy>:128: RuntimeWarning: 'scripts.ci.sandboxed_web_e2e' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.sandboxed_web_e2e'; this may result in unpredictable behaviour

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/test_noema_review_gate.py::test_call_llm_handles_configuration_and_verdicts - AssertionError: Regex pattern did not match.
  Expected regex: 'must start with http:// or https://'
  Actual message: 'URL scheme must be http or https'
================== 1 failed, 165 passed, 5 warnings in 5.71s ===================

Result: FAIL (exit 1)

Python docstring coverage advisory

RESULT: PASSED (minimum: 100.0%, actual: 100.0%)

Result: PASS

Coverage Decision

Result: FAIL
Test evidence: not proven passing
Docstring evidence: not proven passing when configured
Failure count: 1

Changed-File Evidence Map

flowchart LR
  PR["PR changed files"] --> Evidence["OpenCode bounded evidence"]
  Evidence --> S1["Workflow: opencode-review.yml"]
  S1 --> I1["GitHub Actions review job"]
  I1 --> R1["Review risk: Workflow: opencode-review.yml"]
  R1 --> V1["actionlint plus required checks"]
  Evidence --> S2["Test: test_opencode_agent_contract.py"]
  S2 --> I2["regression suite"]
  I2 --> R2["Review risk: Test: test_opencode_agent_contract.py"]
  R2 --> V2["targeted test run"]

seonghobae merged commit 3180040 into main Jul 5, 2026
11 of 12 checks passed

github-actions Bot requested changes Jul 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(review): put contract-reliable reasoning models first, flagships last#318

fix(review): put contract-reliable reasoning models first, flagships last#318
seonghobae merged 1 commit into
mainfrom
fix/review-order-reasoning-first

seonghobae commented Jul 5, 2026

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

seonghobae commented Jul 5, 2026

배경

수정

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Pull request overview

Review outcome

1. HIGH .github/workflows/opencode-review.yml:1 - Coverage evidence did not prove required test/docstring evidence

Coverage evidence

Coverage Evidence

Python project dependencies (.)

Python coverage with missing-line report (.)

Python docstring coverage advisory

Coverage Decision

Changed-File Evidence Map

Uh oh!

github-actions Bot commented Jul 5, 2026

OpenCode Review Overview

Pull request overview

Review outcome

1. HIGH .github/workflows/opencode-review.yml:1 - Coverage evidence did not prove required test/docstring evidence

Coverage evidence

Coverage Evidence

Python project dependencies (.)

Python coverage with missing-line report (.)

Python docstring coverage advisory

Coverage Decision

Changed-File Evidence Map

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant