Skip to content

fix(review): cycle to next model when output fails publish validation#315

Merged
seonghobae merged 1 commit into
mainfrom
fix/review-pool-cycle-on-invalid
Jul 5, 2026
Merged

fix(review): cycle to next model when output fails publish validation#315
seonghobae merged 1 commit into
mainfrom
fix/review-pool-cycle-on-invalid

Conversation

@seonghobae

Copy link
Copy Markdown
Contributor

문제 (순환 버그)

모델 풀은 normalize_output.py원본에 돌려 성공 판정하지만, publish 스텝은 ANSI 제거 후 같은 검증을 돌립니다. 원본은 통과하나 ANSI 제거본은 실패하는 모델(예: deepseek-v3)이 풀의 "성공"으로 기록돼 다음 모델로 순환하지 않고 종료 → publish가 "Selected successful OpenCode output did not include a valid control conclusion"로 리뷰 전체 실패. (모델을 쿼터순 재배치한 뒤 표면화됨.)

수정

normalize_opencode_output가 검증 전에 먼저 ANSI를 제거(publish와 동일 regex)하도록 정렬. 이제 풀은 publish가 받아들일 출력만 성공으로 기록하고, 아니면 다음 모델(o4-mini 등 형식 안정 모델)로 폴백합니다.

범위: 순환 버그만 수정. 모델 순서·타임아웃·ATTEMPTS 등은 미변경.

테스트

shell-syntax + normalize 테스트 26개 통과.

🤖 Generated with Claude Code

…validation

The model pool ran opencode_review_normalize_output.py on the raw candidate
output, but the publish step ANSI-strips the output before running the same
normalizer. A model (e.g. deepseek-v3) whose raw output passed but whose
ANSI-stripped form failed was recorded as the pool's "success", so the pool
stopped instead of trying the next model — and the publish step then failed the
whole review with "Selected successful OpenCode output did not include a valid
control conclusion" (observed after models were reordered by quota).

Strip terminal escapes in normalize_opencode_output before validating, matching
the publish step exactly, so the pool only records success for output the
publish step will accept and otherwise falls through to the next model.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RTAMs4bpSZS77Xe3RQjv9P
@seonghobae seonghobae merged commit 80bf91e into main Jul 5, 2026
12 checks passed
@opencode-agent

opencode-agent Bot commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

OpenCode Review Overview

  • Head SHA: 58e66f703eb51a763f7a4e475abf1f3ea48b27c9
  • Workflow run: 28740596616
  • Workflow attempt: 1
  • Gate result: APPROVE (approval step)

Pull request overview

OpenCode reviewed the current-head bounded evidence and found no blocking issues.

Findings

No blocking findings.

Summary

Approval sufficiency: bounded evidence supplied affirmative approval evidence for changed files, coverage/docstring posture, risk surfaces, and current-head verification; approval is not based merely on the absence of known blockers.
Verification posture: CodeGraph evidence was initialized and bounded current-head evidence reviewed for changed-file evidence including scripts/ci/run_opencode_review_model_pool.sh.
Linter/static: workflow/static review evidence is bounded by the current-head GitHub Checks gate and changed-file evidence.
TDD/regression: coverage execution evidence and focused changed hunks were reviewed from bounded-review-evidence.md.
Coverage: coverage execution evidence reports test coverage as not applicable because no supported changed source files or package manifests were found.
Docstring coverage: coverage execution evidence reports docstring coverage as not applicable because no supported changed source files or package manifests were found.
DAG: CodeGraph/source-backed behavior map connects scripts/ci/run_opencode_review_model_pool.sh to the affected review, runtime, or workflow path and required checks.
PoC/execution: coverage-evidence job executed on the current head and reported PASS.
DDD/domain: workflow and repository-governance invariants were reviewed against changed files in bounded evidence.
CDD/context: CodeGraph evidence, changed-file history, and focused hunks were reviewed from bounded-review-evidence.md.
Similar issues: changed-file history evidence was reviewed for comparable local precedents.
Claim/concept check: bounded evidence, repository source, current-head workflow evidence, and, where numeric, scientific, statistical, or literature-backed claims are affected, original-paper/formula evidence and parameter-recovery expectations were used for claims.
Standards search: standards and external-source checks are delegated to configured OpenCode web_search/Context7/DeepWiki sources when applicable; no evidence-backed standards blocker is present in bounded evidence.
Compatibility/convention: changed workflow/script conventions, object naming, and reserved-word safety for schema/API/config/code surfaces were checked in bounded evidence.
Breaking-change/backcompat: deployment evidence and changed-file history were checked for backward-compatibility risk.
Performance: changed surfaces were checked for performance risk in bounded evidence.
Developer experience: changed automation, review, test, setup, and maintenance surfaces were checked for helpful or obstructive DX impact in bounded evidence.
User experience: connected user, operator, API, CLI, documentation, review-comment, status-check, rendering, and workflow-reader behavior was checked for contradictions against code, docs, and tests in bounded evidence.
Visual/DOM: Playwright visual, DOM locator, ARIA snapshot, console, and responsive evidence were checked when a web UI surface was present; for non-web surfaces, API/CLI/log/docs/workflow interaction evidence was reviewed instead.
Accessibility/i18n: accessibility, localization, and human-readable text surfaces were checked where UI, CLI, API message, docs, logs, or review text changed.
Supply-chain/license: dependency, package, model, container, and external-tool changes were checked in bounded evidence.
Packaging: package, build, test, lint, and security contracts were checked in bounded evidence.
Security/privacy: workflow-token, review-gate, and repository-automation security/privacy boundaries were checked in bounded evidence.

  • Result: APPROVE
  • Reason: The PR addresses a critical issue where the model pool could incorrectly mark a model as successful when its ANSI-stripped output fails validation, preventing the pool from cycling to the next model. The fix ensures consistency between the pool's validation and the publish step's validation.
  • Head SHA: 58e66f703eb51a763f7a4e475abf1f3ea48b27c9
  • Workflow run: 28740596616
  • Workflow attempt: 1

Changed-File Evidence Map

flowchart LR
  PR["PR changed files"] --> Evidence["OpenCode bounded evidence"]
  Evidence --> S1["CI script: run_opencode_review_model_pool.sh"]
  S1 --> I1["review and security gate shell path"]
  I1 --> R1["Review risk: CI script: run_opencode_review_model_pool.sh"]
  R1 --> V1["bash -n plus Strix self-test"]
Loading

@opencode-agent opencode-agent Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

OpenCode reviewed the current-head bounded evidence and found no blocking issues.

Findings

No blocking findings.

Summary

Approval sufficiency: bounded evidence supplied affirmative approval evidence for changed files, coverage/docstring posture, risk surfaces, and current-head verification; approval is not based merely on the absence of known blockers.
Verification posture: CodeGraph evidence was initialized and bounded current-head evidence reviewed for changed-file evidence including scripts/ci/run_opencode_review_model_pool.sh.
Linter/static: workflow/static review evidence is bounded by the current-head GitHub Checks gate and changed-file evidence.
TDD/regression: coverage execution evidence and focused changed hunks were reviewed from bounded-review-evidence.md.
Coverage: coverage execution evidence reports test coverage as not applicable because no supported changed source files or package manifests were found.
Docstring coverage: coverage execution evidence reports docstring coverage as not applicable because no supported changed source files or package manifests were found.
DAG: CodeGraph/source-backed behavior map connects scripts/ci/run_opencode_review_model_pool.sh to the affected review, runtime, or workflow path and required checks.
PoC/execution: coverage-evidence job executed on the current head and reported PASS.
DDD/domain: workflow and repository-governance invariants were reviewed against changed files in bounded evidence.
CDD/context: CodeGraph evidence, changed-file history, and focused hunks were reviewed from bounded-review-evidence.md.
Similar issues: changed-file history evidence was reviewed for comparable local precedents.
Claim/concept check: bounded evidence, repository source, current-head workflow evidence, and, where numeric, scientific, statistical, or literature-backed claims are affected, original-paper/formula evidence and parameter-recovery expectations were used for claims.
Standards search: standards and external-source checks are delegated to configured OpenCode web_search/Context7/DeepWiki sources when applicable; no evidence-backed standards blocker is present in bounded evidence.
Compatibility/convention: changed workflow/script conventions, object naming, and reserved-word safety for schema/API/config/code surfaces were checked in bounded evidence.
Breaking-change/backcompat: deployment evidence and changed-file history were checked for backward-compatibility risk.
Performance: changed surfaces were checked for performance risk in bounded evidence.
Developer experience: changed automation, review, test, setup, and maintenance surfaces were checked for helpful or obstructive DX impact in bounded evidence.
User experience: connected user, operator, API, CLI, documentation, review-comment, status-check, rendering, and workflow-reader behavior was checked for contradictions against code, docs, and tests in bounded evidence.
Visual/DOM: Playwright visual, DOM locator, ARIA snapshot, console, and responsive evidence were checked when a web UI surface was present; for non-web surfaces, API/CLI/log/docs/workflow interaction evidence was reviewed instead.
Accessibility/i18n: accessibility, localization, and human-readable text surfaces were checked where UI, CLI, API message, docs, logs, or review text changed.
Supply-chain/license: dependency, package, model, container, and external-tool changes were checked in bounded evidence.
Packaging: package, build, test, lint, and security contracts were checked in bounded evidence.
Security/privacy: workflow-token, review-gate, and repository-automation security/privacy boundaries were checked in bounded evidence.

  • Result: APPROVE
  • Reason: The PR addresses a critical issue where the model pool could incorrectly mark a model as successful when its ANSI-stripped output fails validation, preventing the pool from cycling to the next model. The fix ensures consistency between the pool's validation and the publish step's validation.
  • Head SHA: 58e66f703eb51a763f7a4e475abf1f3ea48b27c9
  • Workflow run: 28740596616
  • Workflow attempt: 1

Changed-File Evidence Map

flowchart LR
  PR["PR changed files"] --> Evidence["OpenCode bounded evidence"]
  Evidence --> S1["CI script: run_opencode_review_model_pool.sh"]
  S1 --> I1["review and security gate shell path"]
  I1 --> R1["Review risk: CI script: run_opencode_review_model_pool.sh"]
  R1 --> V1["bash -n plus Strix self-test"]
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant