fix: harden scheduler reliability, validation, and observability#51
fix: harden scheduler reliability, validation, and observability#51
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2918ad9809
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/scheduler/service.ts
Outdated
| const delay = Math.max(0, new Date(row.next).getTime() - Date.now()); | ||
| this.timer = setTimeout(() => this.onTimer(), delay); |
There was a problem hiding this comment.
Reintroduce max timeout clamp before scheduling next wake-up
Using the raw delay in setTimeout regresses scheduling for jobs whose next fire is more than ~24.8 days away: both Node and Bun overflow delays above 2^31-1 ms and coerce them to ~1 ms, so the scheduler enters a hot loop (onTimer/DB query/re-arm) until wall clock catches up. This can peg CPU and spam logs whenever an at job or long-interval job is far in the future. The previous clamp avoided this overflow by waking periodically and recalculating.
Useful? React with 👍 / 👎.
Summary
phantom_schedulenow rejects invalid cron expressions, unknown timezones, pastattimestamps, and invalid delivery targets at creation time with descriptive errors. Rows are never inserted withnext_run_at=NULLand the tool surfacesisError: trueso the agent can react.last_delivery_statuscolumn. Every delivery attempt is recorded with a concrete outcome (delivered,dropped:<reason>,error:<reason>). The Slack layer catches errors internally and returnsnull, so the scheduler now checks for that explicitly: a real Slack outage surfaces aserror:slack_returned_nullinstead of being stamped delivered. The existing try/catch remains as a belt-and-braces guard.runtime.isSessionBusy()before callinghandleMessageand skips the fire without advancing cadence or incrementingconsecutive_errors. Direct callers ofhandleMessageoutside the scheduler now seeError: session busyinstead of a silent bounce string that used to be scored as success and delivered to Slack as the result.next_run_atrewrites insidestart()rather than sequentially awaited. A Phantom coming back from downtime no longer blocks boot on its missed schedule backlog./healthincludes a newschedulersummary withtotal,active,paused,completed,failed,nextFireAt, andrecentFailures.min(backoff, next_cron_fire)so transient errors cannot drift a job permanently off its schedule.phantom_claudeDocker volume mounts at/home/phantom/.claude, soclaude logincredentials survive image rebuilds and container recreates. Subscription users no longer have to re-authenticate after every deployment.src/scheduler/service.tswas split by responsibility intoexecutor.ts,delivery.ts,recovery.ts,health.ts,create-validation.ts, androw-mapper.ts. Every scheduler file is now under 300 lines.Additional hardening:
runJobNowrespects the executing guard and rejects non-active status; corrupt rows inlistJobs/getJob/onTimerare logged and skipped rather than bricking the list; duplicate job names rejected at creation; 32 KB task size limit;MAX_JOBS=1000rate limit on creation; cron expressions pinned to 5-field format with explicit nickname rejection;MAX_TIMER_MSclamp removed so the timer sleeps until the actual next fire; 30-day cleanup sweep for terminal rows; runtime status guard refuses to write an out-of-enumstatusvalue;zod-schemaaction field gets a rich multi-line.describe().Every change is additive for existing deployments. No
phantom.yamlor.envchanges required. Deployments that do not currently set aprovider:block continue to work byte-for-byte unchanged.Test plan
bun test: 930 pass, 0 fail (baseline 875, +55 new tests)bun run typecheckcleanbun run lintcleanservice.tsat 269)at, invalid delivery targets, duplicate names, 32 KB task size, and the job count limitdeliverResultcovered for every outcome on every target shape, including the Slack null-return contract and the thrown-error belt pathexecuteJobSlack outage scenarios (null return and thrown error) do not kill subsequent jobs and are correctly recorded inlast_delivery_statusstaggerMissedJobsassertsstart()returns in milliseconds even with 50 missed jobsrunJobNowrejects a concurrent call and rejects non-active job statuslistJobsandgetJobsurvive a corrupt row alongside good rows/healthscheduler summary returns all seven fields@daily,@yearly, etc.)min(backoff, next_cron_fire)on the failure path for cron jobs