From bc6ca78ec639a19688044a20ba151b3191c3b072 Mon Sep 17 00:00:00 2001 From: mrizzi Date: Fri, 29 May 2026 14:31:11 +0200 Subject: [PATCH] feat(verify-pr): add autonomous eval failure sub-task creation and root-cause integration Add eval failure sub-task section to Step 6d that creates Jira sub-tasks for failing eval assertions (grouped by eval ID) with idempotency checks. Update Step 7 to include eval failure sub-tasks as inputs to root-cause investigation alongside review feedback and CI failure sub-tasks. Implements TC-4573 Assisted-by: Claude Code --- evals/verify-pr/evals.json | 4 +- .../sdlc-workflow/skills/verify-pr/SKILL.md | 50 +++++++++++++++++-- 2 files changed, 49 insertions(+), 5 deletions(-) diff --git a/evals/verify-pr/evals.json b/evals/verify-pr/evals.json index 4f518e20..c30229ab 100644 --- a/evals/verify-pr/evals.json +++ b/evals/verify-pr/evals.json @@ -17,6 +17,7 @@ "The verification report assembles verdicts from all four domain sub-agents: Scope Containment, Diff Size, and Commit Traceability from Intent Alignment; Sensitive Patterns from Security; CI Status, Acceptance Criteria, and Verification Commands from Correctness; Test Quality (combining Repetitive Test Detection, Test Documentation, and Eval Quality) and Test Change Classification from Style/Conventions", "The report includes detailed findings with specific evidence for each domain — file-by-file scope comparison (Intent Alignment), line-level pattern scanning results (Security), per-criterion code-level verification (Correctness), and test quality assessment (Style/Conventions) — not just pass/fail verdicts in the summary table", "Eval Quality is N/A because no eval result reviews exist in the PR — the 3-criteria detection (author github-actions[bot], marker ## Eval Results, footer sdlc-workflow/run-evals) found no matches, so Eval Quality does not affect the Test Quality combination", + "No eval failure sub-tasks are created because Eval Quality is N/A — eval failure sub-tasks are only created when Eval Quality is WARN (at least one eval assertion failed)", "Review Feedback and Root-Cause Investigation verdicts are determined by the orchestrator independently from domain analysis — Review Feedback is N/A because no review comments exist, Root-Cause Investigation is N/A because no sub-tasks were created" ] }, @@ -58,7 +59,8 @@ "The report contains a Test Change Classification row with ADDITIVE — only new test files were added (tests/api/sbom_delete.rs is a new file)", "Convention upgrade eligibility is evaluated for review comment 30002 (index suggestion) — the review classification output (review-30002.md) or the report's Style/Conventions analysis explains whether the suggestion matches a documented or demonstrated project convention", "Review comment 30002 (index suggestion) results in a sub-task regardless of classification path — whether classified directly as code change request based on reviewer language, or upgraded from suggestion via convention analysis", - "Eval Quality is N/A because no eval result reviews exist in the PR — the 3-criteria detection (author github-actions[bot], marker ## Eval Results, footer sdlc-workflow/run-evals) found no matches, so Eval Quality does not affect the Test Quality combination" + "Eval Quality is N/A because no eval result reviews exist in the PR — the 3-criteria detection (author github-actions[bot], marker ## Eval Results, footer sdlc-workflow/run-evals) found no matches, so Eval Quality does not affect the Test Quality combination", + "No eval failure sub-tasks are created — the sub-tasks created are exclusively review feedback sub-tasks (labels [\"ai-generated-jira\", \"review-feedback\"]), not eval failure sub-tasks (labels [\"ai-generated-jira\", \"eval-failure\"]), because Eval Quality is N/A" ] }, { diff --git a/plugins/sdlc-workflow/skills/verify-pr/SKILL.md b/plugins/sdlc-workflow/skills/verify-pr/SKILL.md index c809007b..a6a7dd5d 100644 --- a/plugins/sdlc-workflow/skills/verify-pr/SKILL.md +++ b/plugins/sdlc-workflow/skills/verify-pr/SKILL.md @@ -492,6 +492,41 @@ Process `create-sub-task` actions from the Correctness sub-agent. For each actio 3. **Create issue link:** jira.create_issue_link(type="Blocks", inwardIssue=, outwardIssue=) +#### Eval failure sub-tasks + +Process eval assertion failures from the Style/Conventions sub-agent's Check 5 +(Eval Quality verdict and failing assertion details, extracted in Step 6a). Only +proceed if Eval Quality is WARN (i.e., at least one assertion failed). If Eval +Quality is PASS or N/A, skip this section entirely. + +1. **Group by eval ID:** Group failing assertions by eval ID (e.g., all failures + from eval-3 become one sub-task). Each eval with at least one failing assertion + gets one sub-task. The sub-task summary should include the eval ID and a brief + description of the failure category, e.g., "Fix eval-3 assertion failures: + convention upgrade eligibility, sub-task creation". + +2. **Idempotency check:** Check the parent task's existing sub-tasks (issue links) + for sub-tasks with labels `["ai-generated-jira", "eval-failure"]` whose summaries + reference the same eval ID. If a matching sub-task already exists, skip creation + for that eval. + +3. **Create sub-task:** For each failing eval, create a Jira sub-task: + + jira.create_issue with: + - **Parent:** the current task's Jira issue ID + - **Summary:** "Fix assertion failures: " + - **Labels:** `["ai-generated-jira", "eval-failure"]` + - **Description:** structured task description following the template defined in + [`shared/task-description-template.md`](../shared/task-description-template.md). + Include: + - **Review Context** — the eval review body excerpt with specific failing + assertions and their evidence (text + evidence fields from the eval review) + - **Target PR** — the PR URL from Step 2 (so implement-task adds commits to + the existing branch) + +4. **Create issue link:** + jira.create_issue_link(type="Blocks", inwardIssue=, outwardIssue=) + ### Step 6e – Reply to Review Comments Reply to **every** classified review comment thread with the classification label and @@ -561,8 +596,9 @@ Record the Review Feedback check result: ## Step 7 – Root-Cause Investigation Investigate the root cause of each defect — whether flagged by a reviewer (Step 6d -review feedback sub-tasks) or by a CI failure (Step 6d CI failure sub-tasks) — to -identify systemic improvements that prevent similar mistakes in future tasks. +review feedback sub-tasks), by a CI failure (Step 6d CI failure sub-tasks), or by +an eval assertion failure (Step 6d eval failure sub-tasks) — to identify systemic +improvements that prevent similar mistakes in future tasks. ### Step 7a – Sub-agent Investigation @@ -570,8 +606,14 @@ If Step 6d created any sub-tasks, spawn a sub-agent to investigate the root cause of each defect. If no sub-tasks were created, record Root-Cause Investigation as N/A and skip to Step 8. -The sub-agent investigates both reviewer-flagged defects and CI failures -using the same classification and investigation process below. +The sub-agent investigates reviewer-flagged defects, CI failures, and eval +assertion failures using the same classification and investigation process below. + +Eval failure sub-tasks (from Step 6d) are included alongside review feedback +sub-tasks and CI failure sub-tasks as inputs to root-cause investigation. The +existing classification framework applies: eval assertion failures are typically +universal knowledge (eval design patterns apply across repos) and classify as +method-based skill gaps in the implement-task phase. The sub-agent receives these inputs: 1. **Parent Feature description** — fetched by following the "incorporates" issue link