Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ test-temp-*
# skills-package-manager
.agents/skills
.gemini/skills
.claude/skills

# skills-test
skills-test/**/*
Expand Down
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,14 @@ Helps Rspack users and developers debug crashes or deadlocks/hangs in the Rspack

Use this Skill when users encounter "Segmentation fault" errors during Rspack builds or when the build progress gets stuck.

### rstack-eco-ci-debug

```bash
npx skills add rstackjs/agent-skills --skill rstack-eco-ci-debug
```

Debug Rstack ecosystem CI failures and attribute the real source PR behind Rspack eco-ci red suites.

### rspack-tracing

```bash
Expand Down
44 changes: 44 additions & 0 deletions skills-test/rstack-eco-ci-debug/evals/evals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
{
"skill_name": "rstack-eco-ci-debug",
"evals": [
{
"id": 1,
"eval_name": "plugin-suite-empty-lines",
"prompt": "The plugin suite is failing in rstack-ecosystem-ci and the status data shows it turned red around Rspack PR #14254. Can you help me figure out whether that PR is really the cause? My local Rspack checkout is at /Users/bytedance/Documents/codes/rspack.",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace hardcoded local checkout paths with a neutral placeholder.

Using /Users/bytedance/Documents/codes/rspack in committed eval prompts leaks local environment details and makes fixtures less portable. Prefer a placeholder (for example, <RSPACK_CHECKOUT_PATH>) and keep machine-specific values in local runtime inputs.

Suggested change
- "prompt": "... My local Rspack checkout is at /Users/bytedance/Documents/codes/rspack.",
+ "prompt": "... My local Rspack checkout is at <RSPACK_CHECKOUT_PATH>.",

Also applies to: 20-20, 33-33

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills-test/rstack-eco-ci-debug/evals/evals.json` at line 7, Replace the
hardcoded local machine path /Users/bytedance/Documents/codes/rspack with a
neutral placeholder such as <RSPACK_CHECKOUT_PATH> in the evals.json file. This
change should be applied to all occurrences of the hardcoded path (at lines 7,
20, and 33) to remove machine-specific environment details and make the test
fixtures portable across different environments without requiring local
environment-specific values.

"expected_output": "The skill identifies PR #14254 as the actual source of the plugin suite failure, explains that the failure signature involves extra empty lines being added to the error stack, and provides supporting evidence such as run URLs, log snippets, or commit inspection.",
"files": [],
"expectations": [

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Rename expectations to assertions for grading

The new eval criteria are stored under expectations, but the repo's eval definitions consistently use assertions as the grader rubric key (checked the other skills-test/*/evals/evals.json files). With this key, the automated evaluator will not see these four checks, so the reported 12/12 pass rate cannot be reproduced from the committed eval file; rename this field in all three new eval cases to assertions.

Useful? React with 👍 / 👎.

"The output names the plugin suite as the failing suite",
"The output identifies PR #14254 as the actual source, not just the surface pivot",
"The output explains the failure signature (extra empty lines in the error stack)",
"The output includes at least one piece of evidence (run URL, log snippet, or commit reference)"
]
},
{
"id": 2,
"eval_name": "rstest-suite-misattribution",
"prompt": "The rstest suite started failing in rstack-ecosystem-ci and the green-to-red pivot points to Rspack PR #14353. I'm skeptical that #14353 is the real cause. Can you investigate and tell me what actually broke it? My local Rspack checkout is at /Users/bytedance/Documents/codes/rspack.",
"expected_output": "The skill does not blame PR #14353. Instead it identifies rstest PR #1357 as the actual source, explaining that #1357 upgraded to Rspack 2.0.8 (released between PR #14283 and PR #14350) and updated snapshots, which caused the rstest suite failure when the newer Rspack artifact was tested. It distinguishes surface attribution from actual source.",
"files": [],
"expectations": [
"The output does not attribute the failure to PR #14353 as the actual source",
"The output identifies rstest PR #1357 as the actual source or points to it",
"The output explains the snapshot update and Rspack 2.0.8 timing",
"The output distinguishes surface attribution from actual source"
]
},
{
"id": 3,
"eval_name": "rsdoctor-swc-semantic-bug",
"prompt": "In this rstack-ecosystem-ci run the rsdoctor suite failed: https://github.com/rstackjs/rstack-ecosystem-ci/actions/runs/27249648948. Can you find the real source PR and explain the mechanism? My local Rspack checkout is at /Users/bytedance/Documents/codes/rspack.",
"expected_output": "The skill identifies PR #14256 (the swc refactoring) as the actual source of the rsdoctor suite failure. It explains the mechanism: swc exp produced semantic results that differed from swc core, causing the concatenated module not to treat the top-level lightColor variable and the for-loop-init lightColor variable as the same variable during renaming.",
"files": [],
"expectations": [
"The output names the rsdoctor suite as the failing suite",
"The output identifies PR #14256 as the actual source",
"The output explains the swc semantic inconsistency mechanism",
"The output mentions the concatenated module variable renaming issue with lightColor"
]
}
]
}
101 changes: 101 additions & 0 deletions skills-test/rstack-eco-ci-debug/report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# rstack-eco-ci-debug Eval Report

**Skill:** `rstack-eco-ci-debug`
**Skill commit:** `ad2cc4b` (`syt/codex-rstack-eco-ci-debug` branch)
**Date:** 2026-06-18
**Model:** Claude (Opus 4.8)
**Workspace:** `skills-test/rstack-eco-ci-debug/workspace/iteration-1`

---

## Summary

One round of evaluation was run against 3 real rstack-ecosystem-ci failures. Each eval ran once with the skill and once without the skill.

| Metric | With Skill | Without Skill | Delta |
| -------- | ------------ | --------------- | ------- |
| Pass rate | **100%** (12/12) | **75%** (9/12) | **+25 pp** |
| Avg. wall time | 956.7 s | 2,135.5 s | −1,178.8 s |
| Avg. tokens | 92,102 | 46,862 | +45,240 |

The skill produced materially better attribution on the hardest case (rsdoctor SWC semantic bug) and converged faster on the plugin-suite empty-line case. Token usage is higher with the skill because it performs a structured two-phase investigation (Phase 1 PR location, Phase 2 deep root cause).

---

## Eval Cases

### Eval 1 — plugin-suite-empty-lines

**Question:** Why did the `plugin` suite turn red, and is Rspack PR #14254 the real source?
**Surface pivot:** PR #14254 (`feat(runtime): introduce experimental.runtimeMode`).
**Actual source:** PR #14254 — but the failure mechanism is *incidental* trailing newlines in 5 EJS templates, not the runtimeMode feature itself.

| Configuration | Pass Rate | Time | Tokens |
| ------------- | --------- | ---- | ------ |
| with_skill | 4/4 (100%) | 856.1 s | 91,265 |
| without_skill | 4/4 (100%) | 5,374.1 s | 42,605 |

Both configurations correctly identified PR #14254 and the extra-blank-line signature. The with-skill run reached the same conclusion in ~16% of the wall time by following the structured eco-ci workflow.

---

### Eval 2 — rstest-suite-misattribution

**Question:** The eco-ci bisect points at Rspack PR #14353; is it actually the source of the `rstest` suite failure?
**Surface pivot:** PR #14353.
**Actual source:** rstest PR #1357 (downstream snapshot/timeout expectation change).

| Configuration | Pass Rate | Time | Tokens |
| ------------- | --------- | ---- | ------ |
| with_skill | 4/4 (100%) | 1,014.0 s | 105,041 |
| without_skill | 4/4 (100%) | 334.8 s | 29,430 |

Both configurations correctly exonerated PR #14353 and pointed to rstest PR #1357. The without-skill run was faster here because it gave a brief, shallow answer that happened to be correct; the skill ran its full two-phase workflow anyway. No quality regression, but a token/time trade-off.

---

### Eval 3 — rsdoctor-swc-semantic-bug

**Question:** The `rsdoctor` suite failed with `ReferenceError: lightColorCount is not defined`; what is the real source PR and root cause?
**Surface attribution:** release branch at commit `ac3fa6a2d0`.
**Actual source:** PR #14256 (`refactor: swc exp for javascript parser plugin`), interacting with PR #14335's scope-info rewrite.

| Configuration | Pass Rate | Time | Tokens |
| ------------- | --------- | ---- | ------ |
| with_skill | 4/4 (100%) | ~1,000 s* | 80,000* |
| without_skill | 1/4 (25%) | 697.7 s | 68,552 |

This is the discriminating case. Without the skill, the run latched onto a different nearby PR (#14335) and missed the SWC exp/core semantic inconsistency and the `lightColorCount` variable-renaming failure signature. With the skill, the run used the canary-date bisect and revert-commit evidence to identify PR #14256 as the actual source and explained the concatenated-module scope bug.

\* Eval 3 with_skill ran asynchronously; timing was not instrumented by the harness, so values are rough estimates based on comparable runs.

---

## Where the Skill Helped

1. **Surface vs. actual source distinction** — The skill explicitly separates "what the eco-ci dashboard says" from "which PR actually introduced the regression," which prevented the wrong-PR attribution on Eval 3.
2. **Failure-signature anchoring** — It requires tying conclusions to concrete signatures (`lightColorCount is not defined`, extra blank lines in snapshots), not just commit positions.
3. **Structured evidence gathering** — Use of revert commits, green-to-red pivots, and canary-date bisect kept the investigation on track.

## Costs / Trade-offs

- **Higher token usage** with the skill (≈2×) because of the explicit Phase 1 → Phase 2 workflow.
- **Not always faster** when the answer is shallow (Eval 2).
- **Relies on eval cases with clear public artifacts** (run URLs, revert commits, data JSON). Cases without these will regress toward the baseline.

---

## Artifacts

- Eval definitions: `skills-test/rstack-eco-ci-debug/evals/evals.json`
- Raw outputs + grading: `skills-test/rstack-eco-ci-debug/workspace/iteration-1/`
- Quantitative benchmark: `workspace/iteration-1/benchmark.json` and `workspace/iteration-1/benchmark.md`
- Static review viewer: `workspace/iteration-1/review.html`

---

## Next Steps (Suggested)

1. Add a few more discriminating cases where the surface pivot is *not* the real source, to confirm the skill's value isn't driven by a single eval.
2. Consider a shorter "fast path" in the skill for cases where the surface pivot is clearly correct, to reduce token/time overhead on easy attributions.
3. Commit `report.md` and `evals/evals.json`; raw workspace outputs are gitignored.
2 changes: 1 addition & 1 deletion skills.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"$schema": "https://unpkg.com/skills-package-manager@0.11.0/skills.schema.json",
"installDir": ".agents/skills",
"linkTargets": [],
"linkTargets": [".claude/skills"],
"skills": {
"skill-creator": "github:anthropics/skills#57546260929473d4e0d1c1bb75297be2fdfa1949&path:/skills/skill-creator",
"rstack-skill-evaluator": "link:./dev-skills/rstack-skill-evaluator",
Expand Down
Loading
Loading