perf(scheduler): non-blocking script task execution#1114
Conversation
There was a problem hiding this comment.
Pull request overview
This PR’s stated goal is to make scheduled script tasks non-blocking by switching the scheduler from execFileSync to async execFile, while also enriching TaskResult with optional failure metadata. In addition, the PR bundles a large set of repo-health automation scripts/workflows, a new squad skill CLI command (APM integration), platform URL parsing fixes, docs updates, and version/changeset churn consistent with a release bump.
Changes:
- Scheduler: switch script execution to
promisify(execFile)(non-blocking), increase buffers, and surface richer failure details (stderr,code,signal,timedOut). - Automation: add multiple GitHub workflows and supporting scripts for repo health, PR readiness, PR nudges, and impact analysis.
- CLI/docs/tests: add
squad skillcommand (APM), update platform remote parsing to allow dots in repo names, and add/adjust test coverage + documentation.
Show a summary per file
| File | Description |
|---|---|
| test/template-sync.test.ts | Re-sync templates before byte-for-byte parity checks; run sync script with --sync. |
| test/scripts/security-review.test.ts | Updates findings list expectations (removes skill-scan categories). |
| test/scripts/security-review-skills.test.ts | Removes the skill-scanner unit test suite. |
| test/scripts/risk-scorer.test.ts | Adds unit tests for new PR risk scoring utility. |
| test/scripts/parse-diff.test.ts | Adds unit tests for diff name parsing + API file status enrichment. |
| test/scheduler.test.ts | Adds regression tests for new scheduler rich-error fields + event-loop non-blocking behavior. |
| test/platform-adapter.test.ts | Adds regression tests for parsing repo names containing dots (GitHub + ADO). |
| test/migrate-directory.test.cjs | Updates .NET project-type workflow expectations to be created by upgrade not init. |
| test/cross-package-exports.test.ts | Adds runtime smoke tests ensuring CLI-imported SDK exports exist at runtime. |
| test/comms-teams-integration.test.ts | Updates expected warning message for token refresh failure fallback. |
| test/cli/upgrade.test.ts | Uses OS temp dir for test root; removes warnIfSkillCustomized test block. |
| test/cli/loop.test.ts | Updates error expectation wording from “gh CLI” to “Copilot CLI”. |
| test/cli/init.test.ts | Uses OS temp dir for init tests instead of process.cwd(). |
| scripts/security-review.mjs | Introduces a new JSON-reporting security review script for PR diffs. |
| scripts/repo-health-comment.mjs | Adds shared upsert/delete PR comment helper for repo-health checks. |
| scripts/pr-readiness.mjs | Adds PR readiness comment generator + GitHub API orchestration script. |
| scripts/impact-utils/risk-scorer.mjs | Adds risk tier computation utility. |
| scripts/impact-utils/report-generator.mjs | Adds markdown impact report generator. |
| scripts/impact-utils/parse-diff.mjs | Adds diff parsing helpers (gh pr diff --name-only + API status enrichment). |
| scripts/check-squad-leakage.mjs | Adds script to detect .squad/ file leakage in PRs. |
| scripts/check-bootstrap-deps.mjs | Adds gate to ensure protected bootstrap files only import node:* (or relative) deps. |
| scripts/architectural-review.mjs | Adds structural review script (bootstrap area, exports surface, refactors, deletions, etc.). |
| scripts/analyze-impact.mjs | Adds impact analysis CLI script (uses gh to collect PR file data). |
| packages/squad-sdk/src/runtime/scheduler.ts | Non-blocking script execution + richer TaskResult error fields + buffer/timeout constants. |
| packages/squad-sdk/src/platform/detect.ts | Fixes remote parsing regexes to allow . in repo names. |
| packages/squad-sdk/package.json | Bumps SDK version to 0.9.4. |
| packages/squad-cli/src/cli/commands/watch/agent-spawn.ts | Adds shared agent spawn helpers and copilot CLI detection (Windows shell mode). |
| packages/squad-cli/src/cli/commands/skill.ts | Adds squad skill publish/install/list APM integration. |
| packages/squad-cli/src/cli-entry.ts | Wires new skill subcommand into CLI entrypoint. |
| packages/squad-cli/package.json | Bumps CLI version and pins SDK dependency version. |
| package.json | Bumps workspace root version to 0.9.4. |
| package-lock.json | Updates lockfile (but currently shows insider versions for workspace packages). |
| index.cjs | Renames variable hasForce → shouldForce. |
| docs/src/navigation.ts | Updates SDK API reference slug to reference/api-reference. |
| docs/src/content/docs/reference/sdk.md | Updates API reference link to new location. |
| docs/src/content/docs/features/state-backends.md | Adds new “State Backends” documentation page. |
| CHANGELOG.md | Adds a 0.9.4 section header/date. |
| .squad/templates/agents/challenger.md | Adds a new “Challenger” agent template under .squad/templates. |
| .squad/skills/fact-checking/SKILL.md | Adds a new fact-checking skill definition. |
| .squad-templates/workflow-wiring-guide.md | Adds workflow wiring guide template. |
| .squad-templates/workflow-wiring-appendix-b-documenter.md | Adds documenter wiring appendix template. |
| .squad-templates/workflow-wiring-appendix-a-code-reviewer.md | Adds code reviewer wiring appendix template. |
| .squad-templates/squad.agent.md | Updates coordinator template instructions (removes CURRENT_DATETIME usage, adds wiring guidance). |
| .github/workflows/squad-scope-check.yml | Adds workflow to enforce repo-health PR scope boundary. |
| .github/workflows/squad-repo-health.yml | Adds repo-health workflow running bootstrap/deletion/leakage/security/arch checks. |
| .github/workflows/squad-pr-readiness.yml | Adds PR readiness comment workflow (pull_request_target + workflow_run). |
| .github/workflows/squad-pr-nudge.yml | Adds scheduled workflow to comment on stale PRs with diagnosis/actions. |
| .github/workflows/squad-npm-publish.yml | Tightens integrity check condition to resolved.startsWith('https://'). |
| .github/workflows/squad-impact.yml | Adds automated impact report comment workflow. |
| .github/workflows/squad-docs-links.yml | Adds manual weekly docs link check workflow. |
| .github/workflows/squad-ci.yml | Refactors CI jobs/gates (removes “code” path output, reworks policy gates into separate jobs). |
| .github/copilot-instructions.md | Documents the new PR nudge automation. |
| .changeset/* | Removes many prior changesets; adds new changesets for watch fixes, deprecations, and APM integration. |
Copilot's findings
- Files reviewed: 70/71 changed files
- Comments generated: 8
PR Checkup — feedback triagedStatus: 5 of 8 threads resolved (out-of-scope
Root cause: PR was branched from stale Recommended next step: focused repo-health PR + rebase + author fix for the Windows shell injection vectors + re-request review. @copilot-pull-request-reviewer — thanks for the careful review (especially on the security findings). |
99f2c57 to
969f4b0
Compare
LocalPollingProvider.execute()'s 'script' branch used execFileSync which
fully blocks the Node event loop. With the default 60s timeout, a
runaway script could freeze the entire process — no timers, no I/O,
no OpenTelemetry exporters — for up to a minute.
Changes
-------
- Switch from execFileSync to promisify(execFile) for script tasks.
Preserves the structural injection-safety property (shell: false +
validateTaskRef rejects \0 / \r / \n).
- TaskResult extended with optional fields surfaced on rejection:
stderr — captured stderr buffer (trimmed)
code — numeric exit code
signal — terminating signal name
timedOut — true iff the configured 60s timeout fired
All optional / backward-compatible — existing callers reading only
success/output/error continue to work unchanged.
- Bump maxBuffer from Node's ~1MB default to 8MB so commands like
'git log' / 'npm ls' that legitimately produce a few MB of stdout do
not falsely reject with ERR_CHILD_PROCESS_STDIO_MAXBUFFER.
- Constants SCRIPT_DEFAULT_TIMEOUT_MS and SCRIPT_DEFAULT_MAX_BUFFER
hoisted so the next change (e.g. per-entry timeout configurability)
has a single point to override.
Tests
-----
- All 63 existing scheduler tests pass unchanged (including the four
injection-guard tests that prove 'shboyer', backticks, and
newline-in-ref are still rejected).
- Four new cases:
'captures the exit code on a non-zero exit' (code === 7)
'captures stderr written by the child' (stderr captured)
'does not set timedOut for ordinary non-zero exits' (timedOut undefined)
'does not block the event loop while a script runs' (3 timers fire
during a 150ms script sleep — would all be delayed past 150ms
with the previous execFileSync impl)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
969f4b0 to
39633fe
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
✅ All previously-flagged review feedback has been addressed:
@copilot-pull-request-reviewer — ready for another look when you get a chance. Thanks! |
Summary
LocalPollingProvider.execute()'s'script'branch usedexecFileSync— fully synchronous, blocks the entire Node event loop. With the existing 60 s timeout, a runaway script could freeze the process — no timers, no I/O, no OpenTelemetry exporters — for up to a minute per scheduled task. The Ralph watch loop and any concurrent telemetry get blocked in lock-step.Switching to the promisified
execFilekeeps the event loop turning while scripts run, preserves the existing injection-safety guarantees (shell: false +validateTaskRef), and unlocks richer error reporting via optionalTaskResultfields.What Changed
packages/squad-sdk/src/runtime/scheduler.ts:execFileSync→promisify(execFile). The dynamicawait import('node:child_process')is replaced with a static top-level import — the dynamic import was a holdover from an older code path and added a per-task module-resolution lookup.TaskResultinterface extended with four optional fields surfaced on rejection:stderr— captured stderr buffer (trimmed)code— numeric exit code when knownsignal— terminating signal name when knowntimedOut— true iff the configured timeout firedAll optional and backward-compatible: existing callers reading only
success/output/errorcontinue to work unchanged.maxBufferraised from Node's ~1 MB default to 8 MB, so commands likegit log/npm lsthat legitimately produce a few MB of stdout do not falsely reject withERR_CHILD_PROCESS_STDIO_MAXBUFFER.SCRIPT_DEFAULT_TIMEOUT_MS,SCRIPT_DEFAULT_MAX_BUFFER) so a future per-entry-timeout-configurability change has a single point to override.Safety
The structural injection-safety property is preserved:
execFilewithshell: false(the default when args are passed as an array) — no shell interpretation of$(...), backticks,|,;, etc.validateTaskRef()continues to reject refs containing\0,\r, or\n.The four existing injection-guard tests (which exercise
$(whoami), backticks, and newline-in-ref payloads) all pass unchanged.Testing
test/scheduler.test.ts:process.exit(7)→result.code === 7)process.stderr.write('boom-from-stderr')→result.stderrcontains it)result.timedOut === undefined)setTimeoutcallbacks at 10/30/50 ms and runs a 150 ms script. WithexecFileSyncall three would be delayed past 150 ms; with the async version they fire close to schedule. The test assertsticks.length === 3andeach tick < 75 ms, proving the event loop kept turning.Notes
Part of a five-PR perf series. Companion PRs: