Skip to content

perf(scheduler): non-blocking script task execution#1114

Merged
bradygaster merged 2 commits into
bradygaster:devfrom
spboyer:perf/scheduler-async-exec
May 14, 2026
Merged

perf(scheduler): non-blocking script task execution#1114
bradygaster merged 2 commits into
bradygaster:devfrom
spboyer:perf/scheduler-async-exec

Conversation

@spboyer
Copy link
Copy Markdown
Contributor

@spboyer spboyer commented May 12, 2026

Summary

LocalPollingProvider.execute()'s 'script' branch used execFileSync — fully synchronous, blocks the entire Node event loop. With the existing 60 s timeout, a runaway script could freeze the process — no timers, no I/O, no OpenTelemetry exporters — for up to a minute per scheduled task. The Ralph watch loop and any concurrent telemetry get blocked in lock-step.

Switching to the promisified execFile keeps the event loop turning while scripts run, preserves the existing injection-safety guarantees (shell: false + validateTaskRef), and unlocks richer error reporting via optional TaskResult fields.

What Changed

packages/squad-sdk/src/runtime/scheduler.ts:

  • execFileSyncpromisify(execFile). The dynamic await import('node:child_process') is replaced with a static top-level import — the dynamic import was a holdover from an older code path and added a per-task module-resolution lookup.
  • TaskResult interface extended with four optional fields surfaced on rejection:
    • stderr — captured stderr buffer (trimmed)
    • code — numeric exit code when known
    • signal — terminating signal name when known
    • timedOut — true iff the configured timeout fired
      All optional and backward-compatible: existing callers reading only success / output / error continue to work unchanged.
  • maxBuffer raised from Node's ~1 MB default to 8 MB, so commands like git log / npm ls that legitimately produce a few MB of stdout do not falsely reject with ERR_CHILD_PROCESS_STDIO_MAXBUFFER.
  • Two new module-level constants (SCRIPT_DEFAULT_TIMEOUT_MS, SCRIPT_DEFAULT_MAX_BUFFER) so a future per-entry-timeout-configurability change has a single point to override.

Safety

The structural injection-safety property is preserved:

  • execFile with shell: false (the default when args are passed as an array) — no shell interpretation of $(...), backticks, |, ;, etc.
  • validateTaskRef() continues to reject refs containing \0, \r, or \n.

The four existing injection-guard tests (which exercise $(whoami), backticks, and newline-in-ref payloads) all pass unchanged.

Testing

  • All 63 existing scheduler tests pass unchanged.
  • Four new cases in test/scheduler.test.ts:
    • "captures the exit code on a non-zero exit" (process.exit(7)result.code === 7)
    • "captures stderr written by the child" (process.stderr.write('boom-from-stderr')result.stderr contains it)
    • "does not set timedOut for ordinary non-zero exits" (result.timedOut === undefined)
    • "does not block the event loop while a script runs" — schedules three setTimeout callbacks at 10/30/50 ms and runs a 150 ms script. With execFileSync all three would be delayed past 150 ms; with the async version they fire close to schedule. The test asserts ticks.length === 3 and each tick < 75 ms, proving the event loop kept turning.

Notes

Part of a five-PR perf series. Companion PRs:

  • chore: dependency hygiene
  • feat(bench): npm run bench + bench:cold-start runners
  • resolution cache + multi-squad config dedupe
  • parallel charter discovery with bounded concurrency

@spboyer spboyer marked this pull request as ready for review May 13, 2026 13:49
Copilot AI review requested due to automatic review settings May 13, 2026 13:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR’s stated goal is to make scheduled script tasks non-blocking by switching the scheduler from execFileSync to async execFile, while also enriching TaskResult with optional failure metadata. In addition, the PR bundles a large set of repo-health automation scripts/workflows, a new squad skill CLI command (APM integration), platform URL parsing fixes, docs updates, and version/changeset churn consistent with a release bump.

Changes:

  • Scheduler: switch script execution to promisify(execFile) (non-blocking), increase buffers, and surface richer failure details (stderr, code, signal, timedOut).
  • Automation: add multiple GitHub workflows and supporting scripts for repo health, PR readiness, PR nudges, and impact analysis.
  • CLI/docs/tests: add squad skill command (APM), update platform remote parsing to allow dots in repo names, and add/adjust test coverage + documentation.
Show a summary per file
File Description
test/template-sync.test.ts Re-sync templates before byte-for-byte parity checks; run sync script with --sync.
test/scripts/security-review.test.ts Updates findings list expectations (removes skill-scan categories).
test/scripts/security-review-skills.test.ts Removes the skill-scanner unit test suite.
test/scripts/risk-scorer.test.ts Adds unit tests for new PR risk scoring utility.
test/scripts/parse-diff.test.ts Adds unit tests for diff name parsing + API file status enrichment.
test/scheduler.test.ts Adds regression tests for new scheduler rich-error fields + event-loop non-blocking behavior.
test/platform-adapter.test.ts Adds regression tests for parsing repo names containing dots (GitHub + ADO).
test/migrate-directory.test.cjs Updates .NET project-type workflow expectations to be created by upgrade not init.
test/cross-package-exports.test.ts Adds runtime smoke tests ensuring CLI-imported SDK exports exist at runtime.
test/comms-teams-integration.test.ts Updates expected warning message for token refresh failure fallback.
test/cli/upgrade.test.ts Uses OS temp dir for test root; removes warnIfSkillCustomized test block.
test/cli/loop.test.ts Updates error expectation wording from “gh CLI” to “Copilot CLI”.
test/cli/init.test.ts Uses OS temp dir for init tests instead of process.cwd().
scripts/security-review.mjs Introduces a new JSON-reporting security review script for PR diffs.
scripts/repo-health-comment.mjs Adds shared upsert/delete PR comment helper for repo-health checks.
scripts/pr-readiness.mjs Adds PR readiness comment generator + GitHub API orchestration script.
scripts/impact-utils/risk-scorer.mjs Adds risk tier computation utility.
scripts/impact-utils/report-generator.mjs Adds markdown impact report generator.
scripts/impact-utils/parse-diff.mjs Adds diff parsing helpers (gh pr diff --name-only + API status enrichment).
scripts/check-squad-leakage.mjs Adds script to detect .squad/ file leakage in PRs.
scripts/check-bootstrap-deps.mjs Adds gate to ensure protected bootstrap files only import node:* (or relative) deps.
scripts/architectural-review.mjs Adds structural review script (bootstrap area, exports surface, refactors, deletions, etc.).
scripts/analyze-impact.mjs Adds impact analysis CLI script (uses gh to collect PR file data).
packages/squad-sdk/src/runtime/scheduler.ts Non-blocking script execution + richer TaskResult error fields + buffer/timeout constants.
packages/squad-sdk/src/platform/detect.ts Fixes remote parsing regexes to allow . in repo names.
packages/squad-sdk/package.json Bumps SDK version to 0.9.4.
packages/squad-cli/src/cli/commands/watch/agent-spawn.ts Adds shared agent spawn helpers and copilot CLI detection (Windows shell mode).
packages/squad-cli/src/cli/commands/skill.ts Adds squad skill publish/install/list APM integration.
packages/squad-cli/src/cli-entry.ts Wires new skill subcommand into CLI entrypoint.
packages/squad-cli/package.json Bumps CLI version and pins SDK dependency version.
package.json Bumps workspace root version to 0.9.4.
package-lock.json Updates lockfile (but currently shows insider versions for workspace packages).
index.cjs Renames variable hasForceshouldForce.
docs/src/navigation.ts Updates SDK API reference slug to reference/api-reference.
docs/src/content/docs/reference/sdk.md Updates API reference link to new location.
docs/src/content/docs/features/state-backends.md Adds new “State Backends” documentation page.
CHANGELOG.md Adds a 0.9.4 section header/date.
.squad/templates/agents/challenger.md Adds a new “Challenger” agent template under .squad/templates.
.squad/skills/fact-checking/SKILL.md Adds a new fact-checking skill definition.
.squad-templates/workflow-wiring-guide.md Adds workflow wiring guide template.
.squad-templates/workflow-wiring-appendix-b-documenter.md Adds documenter wiring appendix template.
.squad-templates/workflow-wiring-appendix-a-code-reviewer.md Adds code reviewer wiring appendix template.
.squad-templates/squad.agent.md Updates coordinator template instructions (removes CURRENT_DATETIME usage, adds wiring guidance).
.github/workflows/squad-scope-check.yml Adds workflow to enforce repo-health PR scope boundary.
.github/workflows/squad-repo-health.yml Adds repo-health workflow running bootstrap/deletion/leakage/security/arch checks.
.github/workflows/squad-pr-readiness.yml Adds PR readiness comment workflow (pull_request_target + workflow_run).
.github/workflows/squad-pr-nudge.yml Adds scheduled workflow to comment on stale PRs with diagnosis/actions.
.github/workflows/squad-npm-publish.yml Tightens integrity check condition to resolved.startsWith('https://').
.github/workflows/squad-impact.yml Adds automated impact report comment workflow.
.github/workflows/squad-docs-links.yml Adds manual weekly docs link check workflow.
.github/workflows/squad-ci.yml Refactors CI jobs/gates (removes “code” path output, reworks policy gates into separate jobs).
.github/copilot-instructions.md Documents the new PR nudge automation.
.changeset/* Removes many prior changesets; adds new changesets for watch fixes, deprecations, and APM integration.

Copilot's findings

  • Files reviewed: 70/71 changed files
  • Comments generated: 8

Comment thread .github/workflows/squad-pr-readiness.yml Outdated
Comment thread scripts/pr-readiness.mjs Outdated
Comment thread .github/workflows/squad-pr-nudge.yml Outdated
Comment thread packages/squad-cli/src/cli/commands/skill.ts Outdated
Comment thread packages/squad-cli/src/cli/commands/skill.ts Outdated
Comment thread packages/squad-cli/src/cli/commands/skill.ts Outdated
Comment thread packages/squad-cli/src/cli/commands/watch/agent-spawn.ts Outdated
Comment thread packages/squad-sdk/src/runtime/scheduler.ts
@spboyer
Copy link
Copy Markdown
Contributor Author

spboyer commented May 13, 2026

PR Checkup — feedback triaged

Status: 5 of 8 threads resolved (out-of-scope squad skill and workflow/script bugs to be addressed in a focused repo-health PR + rebase). 3 threads left open intentionally:

  • Meta scope thread on scheduler.ts:20 — PR diff includes the squad skill command which is unrelated to scheduler
  • Two security threads on Windows shell: true with interpolated args (agent-spawn.ts:107, skill.ts:321) — need author fix (argv-array or input validation), not auto-resolution

Root cause: PR was branched from stale main (~36 unrelated commits). The intended execFileSyncexecFile scheduler change is the only in-scope content.

Recommended next step: focused repo-health PR + rebase + author fix for the Windows shell injection vectors + re-request review.

@copilot-pull-request-reviewer — thanks for the careful review (especially on the security findings).

@spboyer spboyer force-pushed the perf/scheduler-async-exec branch from 99f2c57 to 969f4b0 Compare May 13, 2026 14:49
LocalPollingProvider.execute()'s 'script' branch used execFileSync which
fully blocks the Node event loop. With the default 60s timeout, a
runaway script could freeze the entire process — no timers, no I/O,
no OpenTelemetry exporters — for up to a minute.

Changes
-------
- Switch from execFileSync to promisify(execFile) for script tasks.
  Preserves the structural injection-safety property (shell: false +
  validateTaskRef rejects \0 / \r / \n).
- TaskResult extended with optional fields surfaced on rejection:
    stderr      — captured stderr buffer (trimmed)
    code        — numeric exit code
    signal      — terminating signal name
    timedOut    — true iff the configured 60s timeout fired
  All optional / backward-compatible — existing callers reading only
  success/output/error continue to work unchanged.
- Bump maxBuffer from Node's ~1MB default to 8MB so commands like
  'git log' / 'npm ls' that legitimately produce a few MB of stdout do
  not falsely reject with ERR_CHILD_PROCESS_STDIO_MAXBUFFER.
- Constants SCRIPT_DEFAULT_TIMEOUT_MS and SCRIPT_DEFAULT_MAX_BUFFER
  hoisted so the next change (e.g. per-entry timeout configurability)
  has a single point to override.

Tests
-----
- All 63 existing scheduler tests pass unchanged (including the four
  injection-guard tests that prove 'shboyer', backticks, and
  newline-in-ref are still rejected).
- Four new cases:
    'captures the exit code on a non-zero exit'           (code === 7)
    'captures stderr written by the child'                (stderr captured)
    'does not set timedOut for ordinary non-zero exits'   (timedOut undefined)
    'does not block the event loop while a script runs'   (3 timers fire
        during a 150ms script sleep — would all be delayed past 150ms
        with the previous execFileSync impl)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer force-pushed the perf/scheduler-async-exec branch from 969f4b0 to 39633fe Compare May 13, 2026 15:10
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer
Copy link
Copy Markdown
Contributor Author

spboyer commented May 13, 2026

✅ All previously-flagged review feedback has been addressed:

  • Rebased onto current dev so the diff is now scoped to just non-blocking script task execution in the scheduler — earlier scope-drift comments no longer apply.
  • Resolved every unresolved review thread with an explanatory reply.
  • CI is green (test, Policy Gates, sdk-exports-validation, samples-build).

@copilot-pull-request-reviewer — ready for another look when you get a chance. Thanks!

@bradygaster bradygaster merged commit 754de09 into bradygaster:dev May 14, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants