Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion evals/verify-pr/evals.json
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
"The verification report assembles verdicts from all four domain sub-agents: Scope Containment, Diff Size, and Commit Traceability from Intent Alignment; Sensitive Patterns from Security; CI Status, Acceptance Criteria, and Verification Commands from Correctness; Test Quality (combining Repetitive Test Detection, Test Documentation, and Eval Quality) and Test Change Classification from Style/Conventions",
"The report includes detailed findings with specific evidence for each domain — file-by-file scope comparison (Intent Alignment), line-level pattern scanning results (Security), per-criterion code-level verification (Correctness), and test quality assessment (Style/Conventions) — not just pass/fail verdicts in the summary table",
"Eval Quality is N/A because no eval result reviews exist in the PR — the 3-criteria detection (author github-actions[bot], marker ## Eval Results, footer sdlc-workflow/run-evals) found no matches, so Eval Quality does not affect the Test Quality combination",
"No eval failure sub-tasks are created because Eval Quality is N/A — eval failure sub-tasks are only created when Eval Quality is WARN (at least one eval assertion failed)",
"Review Feedback and Root-Cause Investigation verdicts are determined by the orchestrator independently from domain analysis — Review Feedback is N/A because no review comments exist, Root-Cause Investigation is N/A because no sub-tasks were created"
]
},
Expand Down Expand Up @@ -58,7 +59,8 @@
"The report contains a Test Change Classification row with ADDITIVE — only new test files were added (tests/api/sbom_delete.rs is a new file)",
"Convention upgrade eligibility is evaluated for review comment 30002 (index suggestion) — the review classification output (review-30002.md) or the report's Style/Conventions analysis explains whether the suggestion matches a documented or demonstrated project convention",
"Review comment 30002 (index suggestion) results in a sub-task regardless of classification path — whether classified directly as code change request based on reviewer language, or upgraded from suggestion via convention analysis",
"Eval Quality is N/A because no eval result reviews exist in the PR — the 3-criteria detection (author github-actions[bot], marker ## Eval Results, footer sdlc-workflow/run-evals) found no matches, so Eval Quality does not affect the Test Quality combination"
"Eval Quality is N/A because no eval result reviews exist in the PR — the 3-criteria detection (author github-actions[bot], marker ## Eval Results, footer sdlc-workflow/run-evals) found no matches, so Eval Quality does not affect the Test Quality combination",
"No eval failure sub-tasks are created — the sub-tasks created are exclusively review feedback sub-tasks (labels [\"ai-generated-jira\", \"review-feedback\"]), not eval failure sub-tasks (labels [\"ai-generated-jira\", \"eval-failure\"]), because Eval Quality is N/A"
]
},
{
Expand Down
50 changes: 46 additions & 4 deletions plugins/sdlc-workflow/skills/verify-pr/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -492,6 +492,41 @@ Process `create-sub-task` actions from the Correctness sub-agent. For each actio
3. **Create issue link:**
jira.create_issue_link(type="Blocks", inwardIssue=<sub-task-id>, outwardIssue=<parent-task-id>)

#### Eval failure sub-tasks

Process eval assertion failures from the Style/Conventions sub-agent's Check 5
(Eval Quality verdict and failing assertion details, extracted in Step 6a). Only
proceed if Eval Quality is WARN (i.e., at least one assertion failed). If Eval
Quality is PASS or N/A, skip this section entirely.

1. **Group by eval ID:** Group failing assertions by eval ID (e.g., all failures
from eval-3 become one sub-task). Each eval with at least one failing assertion
gets one sub-task. The sub-task summary should include the eval ID and a brief
description of the failure category, e.g., "Fix eval-3 assertion failures:
convention upgrade eligibility, sub-task creation".

2. **Idempotency check:** Check the parent task's existing sub-tasks (issue links)
for sub-tasks with labels `["ai-generated-jira", "eval-failure"]` whose summaries
reference the same eval ID. If a matching sub-task already exists, skip creation
for that eval.

3. **Create sub-task:** For each failing eval, create a Jira sub-task:

jira.create_issue with:
- **Parent:** the current task's Jira issue ID
- **Summary:** "Fix <eval-id> assertion failures: <brief description of failures>"
- **Labels:** `["ai-generated-jira", "eval-failure"]`
- **Description:** structured task description following the template defined in
[`shared/task-description-template.md`](../shared/task-description-template.md).
Include:
- **Review Context** — the eval review body excerpt with specific failing
assertions and their evidence (text + evidence fields from the eval review)
- **Target PR** — the PR URL from Step 2 (so implement-task adds commits to
the existing branch)

4. **Create issue link:**
jira.create_issue_link(type="Blocks", inwardIssue=<sub-task-id>, outwardIssue=<parent-task-id>)

### Step 6e – Reply to Review Comments

Reply to **every** classified review comment thread with the classification label and
Expand Down Expand Up @@ -561,17 +596,24 @@ Record the Review Feedback check result:
## Step 7 – Root-Cause Investigation

Investigate the root cause of each defect — whether flagged by a reviewer (Step 6d
review feedback sub-tasks) or by a CI failure (Step 6d CI failure sub-tasks) — to
identify systemic improvements that prevent similar mistakes in future tasks.
review feedback sub-tasks), by a CI failure (Step 6d CI failure sub-tasks), or by
an eval assertion failure (Step 6d eval failure sub-tasks) — to identify systemic
improvements that prevent similar mistakes in future tasks.

### Step 7a – Sub-agent Investigation

If Step 6d created any sub-tasks, spawn a sub-agent to investigate the root
cause of each defect. If no sub-tasks were created, record Root-Cause
Investigation as N/A and skip to Step 8.

The sub-agent investigates both reviewer-flagged defects and CI failures
using the same classification and investigation process below.
The sub-agent investigates reviewer-flagged defects, CI failures, and eval
assertion failures using the same classification and investigation process below.

Eval failure sub-tasks (from Step 6d) are included alongside review feedback
sub-tasks and CI failure sub-tasks as inputs to root-cause investigation. The
existing classification framework applies: eval assertion failures are typically
universal knowledge (eval design patterns apply across repos) and classify as
method-based skill gaps in the implement-task phase.

The sub-agent receives these inputs:
1. **Parent Feature description** — fetched by following the "incorporates" issue link
Expand Down
Loading