fix: harden scheduler reliability, validation, and observability by mcheemaa · Pull Request #51 · ghostwright/phantom

mcheemaa · 2026-04-12T18:55:55Z

Summary

Fail-fast schedule validation. phantom_schedule now rejects invalid cron expressions, unknown timezones, past at timestamps, and invalid delivery targets at creation time with descriptive errors. Rows are never inserted with next_run_at=NULL and the tool surfaces isError: true so the agent can react.
Delivery reliability and a new last_delivery_status column. Every delivery attempt is recorded with a concrete outcome (delivered, dropped:<reason>, error:<reason>). The Slack layer catches errors internally and returns null, so the scheduler now checks for that explicitly: a real Slack outage surfaces as error:slack_returned_null instead of being stamped delivered. The existing try/catch remains as a belt-and-braces guard.
Transient concurrent execution handling. The scheduler checks a new runtime.isSessionBusy() before calling handleMessage and skips the fire without advancing cadence or incrementing consecutive_errors. Direct callers of handleMessage outside the scheduler now see Error: session busy instead of a silent bounce string that used to be scored as success and delivered to Slack as the result.
Non-blocking missed-job recovery. Missed jobs are staggered via next_run_at rewrites inside start() rather than sequentially awaited. A Phantom coming back from downtime no longer blocks boot on its missed schedule backlog.
Scheduler observability. /health includes a new scheduler summary with total, active, paused, completed, failed, nextFireAt, and recentFailures.
Cron backoff preserves cadence. On failure, cron jobs pick min(backoff, next_cron_fire) so transient errors cannot drift a job permanently off its schedule.
Claude Code credentials persist across container restarts. A new phantom_claude Docker volume mounts at /home/phantom/.claude, so claude login credentials survive image rebuilds and container recreates. Subscription users no longer have to re-authenticate after every deployment.
Scheduler module split for maintainability. src/scheduler/service.ts was split by responsibility into executor.ts, delivery.ts, recovery.ts, health.ts, create-validation.ts, and row-mapper.ts. Every scheduler file is now under 300 lines.

Additional hardening: runJobNow respects the executing guard and rejects non-active status; corrupt rows in listJobs/getJob/onTimer are logged and skipped rather than bricking the list; duplicate job names rejected at creation; 32 KB task size limit; MAX_JOBS=1000 rate limit on creation; cron expressions pinned to 5-field format with explicit nickname rejection; MAX_TIMER_MS clamp removed so the timer sleeps until the actual next fire; 30-day cleanup sweep for terminal rows; runtime status guard refuses to write an out-of-enum status value; zod-schema action field gets a rich multi-line .describe().

Every change is additive for existing deployments. No phantom.yaml or .env changes required. Deployments that do not currently set a provider: block continue to work byte-for-byte unchanged.

Test plan

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2918ad9809

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-12T19:00:10Z

src/scheduler/service.ts

+		const delay = Math.max(0, new Date(row.next).getTime() - Date.now());
+		this.timer = setTimeout(() => this.onTimer(), delay);


Reintroduce max timeout clamp before scheduling next wake-up

Using the raw delay in setTimeout regresses scheduling for jobs whose next fire is more than ~24.8 days away: both Node and Bun overflow delays above 2^31-1 ms and coerce them to ~1 ms, so the scheduler enters a hot loop (onTimer/DB query/re-arm) until wall clock catches up. This can peg CPU and spam logs whenever an at job or long-interval job is far in the future. The previous clamp avoided this overflow by waking periodically and recalculating.

Useful? React with 👍 / 👎.

…erflow

fix: harden scheduler reliability, validation, and observability

2918ad9

chatgpt-codex-connector bot reviewed Apr 12, 2026

View reviewed changes

fix: clamp scheduler timer to one hour to prevent setTimeout int32 ov…

bed5e9b

…erflow

mcheemaa merged commit b3cc966 into main Apr 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden scheduler reliability, validation, and observability#51

fix: harden scheduler reliability, validation, and observability#51
mcheemaa merged 2 commits intomainfrom
phase2.5/scheduler-fixes

mcheemaa commented Apr 12, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		const delay = Math.max(0, new Date(row.next).getTime() - Date.now());
		this.timer = setTimeout(() => this.onTimer(), delay);

Conversation

mcheemaa commented Apr 12, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant