[codex] gate scorer registration on readiness#6
Draft
V-SK wants to merge 1 commit into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
This PR hardens the public scorer worker so it does not register itself as available before it is actually ready to score.
Changes in
scorer/scoring_server.py:state,state_detail,ready,last_ready_at)GET /statusandGET /ready503 worker_not_readyfrom/scoreand/validateuntil warmup completes/readyprobe returns200degradedinstead of silently advertising readiness when warmup failsWhy this changed
The live scoring path was suffering from a cold-start window where the scorer endpoint could be registered with the PS/Aggregator while the process was still loading the model or warming caches. That led to avoidable connection failures, bad-gateway behavior, and deferred scoring candidates getting stuck behind a scorer that looked registered but was not actually ready.
The fix is to make readiness explicit and only register the scorer after the worker can answer real requests.
Impact
alivefromreadydegraded/not readyinstead of causing false-positive registrationRoot cause
The old worker exposed only
/healthand registered its public endpoint immediately at startup. That conflated process liveness with scoring readiness.Validation
python3 -m py_compile scorer/scoring_server.py/readyreturned200scorer_ready=true