fix(review): put contract-reliable reasoning models first, flagships last#318
Conversation
…last Ordering the highest-quota non-reasoning models first (deepseek-v3, mistral, llama-4) surfaced that they emit reviews the publish/approve gates reject — bare APPROVE summaries missing required labels, then hallucinated non-source-backed REQUEST_CHANGES findings — and the pool records them as success instead of falling through. The mini reasoning models (o4-mini, o3-mini, gpt-5-mini/nano/ chat) reliably emit the strict contract and have adequate quota, so lead with them; keep deepseek/mistral/llama as fallbacks and the rate-starved flagships o3/gpt-5 last (first-placing gpt-5 stalled every review to timeout). Matches the pre-#295 known-good lead model (o4-mini). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RTAMs4bpSZS77Xe3RQjv9P
There was a problem hiding this comment.
Pull request overview
OpenCode cannot approve yet because required coverage evidence did not pass.
Review outcome
1. HIGH .github/workflows/opencode-review.yml:1 - Coverage evidence did not prove required test/docstring evidence
-
Problem: The required coverage-evidence job result was
failure, so OpenCode cannot establish approval sufficiency for this head. -
Root cause: Automated approval is only valid when the same-head coverage-evidence job proves supported repository test suites passed and configured docstring gates passed or were advisory, or reports not applicable because no supported source files or package manifests exist. Missing, failed, skipped, unavailable, or unsupported-tooling test evidence is a blocker.
-
Fix: Install or configure the repository test/docstring evidence tooling when source files or package manifests exist, rerun the current-head coverage-evidence job, and approve only after it reports
successwith required evidence or explicit no-source not-applicable evidence. -
Regression test: Keep the approval branch checking
needs.coverage-evidence.result == successbefore posting APPROVE, and publish REQUEST_CHANGES when coverage-evidence blocker states such as cancelled, skipped, failed, unsupported-tooling, or below-100 evidence are present. -
Result: REQUEST_CHANGES
-
Reason: coverage-evidence result was
failure, so required test/docstring evidence was not proven for current head337a8bfc5039402da4d73687065d18f7ebec9beb. -
Head SHA:
337a8bfc5039402da4d73687065d18f7ebec9beb -
Workflow run: 28744364199
-
Workflow attempt: 1
Coverage evidence
Coverage Evidence
- Head SHA:
337a8bfc5039402da4d73687065d18f7ebec9beb - Required test evidence: supported repository test suites must pass.
- Required docstring evidence: repository-owned docstring gates must pass when configured; otherwise docstring coverage is advisory.
Python project dependencies (.)
Using CPython 3.12.3 interpreter at: /usr/bin/python3
Creating virtual environment at: .venv
Resolved 17 packages in 117ms
Downloading pygments (1.2MiB)
Downloaded pygments
Prepared 13 packages in 89ms
Installed 13 packages in 11ms
+ attrs==26.1.0
+ click==8.4.2
+ colorama==0.4.6
+ coverage==7.15.0
+ iniconfig==2.3.0
+ interrogate==1.7.0
+ packaging==26.2
+ pluggy==1.6.0
+ py==1.11.0
+ pygments==2.20.0
+ pytest==9.1.1
+ pytest-cov==7.1.0
+ tabulate==0.10.0
- Result: PASS
Python coverage with missing-line report (.)
============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-9.1.1, pluggy-1.6.0
rootdir: /home/runner/work/.github/.github/pr-head
configfile: pyproject.toml
plugins: cov-7.1.0
collected 166 items
tests/test_assert_opencode_reasoning_effort.py ........ [ 4%]
tests/test_codeql_pr_workflow_contract.py . [ 5%]
tests/test_noema_review_gate.py .......F... [ 12%]
tests/test_opencode_agent_contract.py ............. [ 19%]
tests/test_opencode_review_normalize_output.py ......................... [ 34%]
[ 34%]
tests/test_opencode_workflow_shell_syntax.py . [ 35%]
tests/test_pr_governance_audit_contract.py ... [ 37%]
tests/test_pr_review_fix_scheduler.py ................... [ 48%]
tests/test_pr_review_fix_scheduler_coverage.py .. [ 50%]
tests/test_pr_review_merge_scheduler.py ................................ [ 69%]
.............................. [ 87%]
tests/test_render_opencode_prompt_template.py .... [ 89%]
tests/test_review_execution_contracts.py .. [ 90%]
tests/test_sandboxed_verify.py ......... [ 96%]
tests/test_sandboxed_web_e2e.py ...... [100%]
=================================== FAILURES ===================================
_______________ test_call_llm_handles_configuration_and_verdicts _______________
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fcf837b0c80>
def test_call_llm_handles_configuration_and_verdicts(monkeypatch):
pr = make_pr()
monkeypatch.delenv("NOEMA_LLM_API_URL", raising=False)
monkeypatch.delenv("NOEMA_LLM_API_KEY", raising=False)
assert noema.call_llm("owner/repo", 1, pr, "diff", False) is None
monkeypatch.setenv("NOEMA_LLM_API_URL", "file:///etc/passwd")
monkeypatch.setenv("NOEMA_LLM_API_KEY", "secret")
> with pytest.raises(ValueError, match="must start with http:// or https://"):
E AssertionError: Regex pattern did not match.
E Expected regex: 'must start with http:// or https://'
E Actual message: 'URL scheme must be http or https'
tests/test_noema_review_gate.py:209: AssertionError
----------------------------- Captured stdout call -----------------------------
Noema LLM review unavailable: NOEMA_LLM_API_URL or NOEMA_LLM_API_KEY is not configured.
=============================== warnings summary ===============================
tests/test_assert_opencode_reasoning_effort.py::test_module_entrypoint_success
<frozen runpy>:128: RuntimeWarning: 'scripts.ci.assert_opencode_reasoning_effort' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.assert_opencode_reasoning_effort'; this may result in unpredictable behaviour
tests/test_render_opencode_prompt_template.py::test_module_entrypoint
<frozen runpy>:128: RuntimeWarning: 'scripts.ci.render_opencode_prompt_template' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.render_opencode_prompt_template'; this may result in unpredictable behaviour
tests/test_review_execution_contracts.py::test_discovers_package_managers_java_r_json_and_main
<frozen runpy>:128: RuntimeWarning: 'scripts.ci.review_execution_contracts' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.review_execution_contracts'; this may result in unpredictable behaviour
tests/test_sandboxed_verify.py::test_module_main_entrypoint
<frozen runpy>:128: RuntimeWarning: 'scripts.ci.sandboxed_verify' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.sandboxed_verify'; this may result in unpredictable behaviour
tests/test_sandboxed_web_e2e.py::test_module_import_and_main_entrypoint
<frozen runpy>:128: RuntimeWarning: 'scripts.ci.sandboxed_web_e2e' found in sys.modules after import of package 'scripts.ci', but prior to execution of 'scripts.ci.sandboxed_web_e2e'; this may result in unpredictable behaviour
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/test_noema_review_gate.py::test_call_llm_handles_configuration_and_verdicts - AssertionError: Regex pattern did not match.
Expected regex: 'must start with http:// or https://'
Actual message: 'URL scheme must be http or https'
================== 1 failed, 165 passed, 5 warnings in 5.71s ===================
- Result: FAIL (exit 1)
Python docstring coverage advisory
RESULT: PASSED (minimum: 100.0%, actual: 100.0%)
- Result: PASS
Coverage Decision
- Result: FAIL
- Test evidence: not proven passing
- Docstring evidence: not proven passing when configured
- Failure count: 1
Changed-File Evidence Map
flowchart LR
PR["PR changed files"] --> Evidence["OpenCode bounded evidence"]
Evidence --> S1["Workflow: opencode-review.yml"]
S1 --> I1["GitHub Actions review job"]
I1 --> R1["Review risk: Workflow: opencode-review.yml"]
R1 --> V1["actionlint plus required checks"]
Evidence --> S2["Test: test_opencode_agent_contract.py"]
S2 --> I2["regression suite"]
I2 --> R2["Review risk: Test: test_opencode_agent_contract.py"]
R2 --> V2["targeted test run"]
OpenCode Review Overview
Pull request overviewOpenCode cannot approve yet because required coverage evidence did not pass. Review outcome1. HIGH .github/workflows/opencode-review.yml:1 - Coverage evidence did not prove required test/docstring evidence
Coverage evidenceCoverage Evidence
Python project dependencies (.)
Python coverage with missing-line report (.)
Python docstring coverage advisory
Coverage Decision
Changed-File Evidence Mapflowchart LR
PR["PR changed files"] --> Evidence["OpenCode bounded evidence"]
Evidence --> S1["Workflow: opencode-review.yml"]
S1 --> I1["GitHub Actions review job"]
I1 --> R1["Review risk: Workflow: opencode-review.yml"]
R1 --> V1["actionlint plus required checks"]
Evidence --> S2["Test: test_opencode_agent_contract.py"]
S2 --> I2["regression suite"]
I2 --> R2["Review risk: Test: test_opencode_agent_contract.py"]
R2 --> V2["targeted test run"]
|
배경
쿼터순(deepseek-v3 우선) 배치가 표면화한 문제: 비추론 고쿼터 모델이 리뷰 계약을 못 지킴 — 라벨 없는 bare APPROVE(#317로 repair), 이어서 환각성(비-source-backed) REQUEST_CHANGES finding → Approve PR 게이트가 거부 → 리뷰 실패(순환 안 함).
수정
계약을 안정적으로 지키는 mini 추론 모델(o4-mini·o3-mini·gpt-5-mini/nano/chat)을 1순위로. deepseek/mistral/llama는 폴백, 최소쿼터 flagship(o3·gpt-5)은 최후미(gpt-5 1순위는 6시간 정체 유발). #295 이전의 검증된 리드 모델(o4-mini)과 일치.
설정(ATTEMPTS/RUN_TIMEOUT/step timeout)은 미변경. 가드 테스트 갱신.
🤖 Generated with Claude Code