Skip to content

[codex] gate scorer registration on readiness#6

Draft
V-SK wants to merge 1 commit into
mainfrom
codex/scorer-ready-gating
Draft

[codex] gate scorer registration on readiness#6
V-SK wants to merge 1 commit into
mainfrom
codex/scorer-ready-gating

Conversation

@V-SK
Copy link
Copy Markdown
Owner

@V-SK V-SK commented Apr 13, 2026

What changed

This PR hardens the public scorer worker so it does not register itself as available before it is actually ready to score.

Changes in scorer/scoring_server.py:

  • add explicit runtime state fields (state, state_detail, ready, last_ready_at)
  • add startup baseline warmup before the worker is considered ready
  • add GET /status and GET /ready
  • return 503 worker_not_ready from /score and /validate until warmup completes
  • delay periodic endpoint registration until the local /ready probe returns 200
  • re-run baseline warmup after reload and successful auto-update paths
  • mark the worker degraded instead of silently advertising readiness when warmup fails

Why this changed

The live scoring path was suffering from a cold-start window where the scorer endpoint could be registered with the PS/Aggregator while the process was still loading the model or warming caches. That led to avoidable connection failures, bad-gateway behavior, and deferred scoring candidates getting stuck behind a scorer that looked registered but was not actually ready.

The fix is to make readiness explicit and only register the scorer after the worker can answer real requests.

Impact

  • scorers now stay out of rotation until startup warmup has completed
  • first real score avoids the worst cold-cache penalty because baseline validation is prepaid at startup
  • control-plane consumers can distinguish alive from ready
  • failed warmup becomes visible as degraded/not ready instead of causing false-positive registration

Root cause

The old worker exposed only /health and registered its public endpoint immediately at startup. That conflated process liveness with scoring readiness.

Validation

  • python3 -m py_compile scorer/scoring_server.py
  • live audit on the production scorer/aggregator pair after deploying the same readiness design:
    • scorer /ready returned 200
    • aggregator observed scorer_ready=true
    • deferred scoring candidate cleared
    • scorer evaluation count advanced successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant