fix: prevent process hang when vLLM engine dies during async generation#1691
Open
dinhxuanvu wants to merge 1 commit into
Open
fix: prevent process hang when vLLM engine dies during async generation#1691dinhxuanvu wants to merge 1 commit into
dinhxuanvu wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request modifies the exception handling in the _run_generate_for_a_group_loop method of the fully asynchronous trainer, replacing runtime errors with error logging. Review feedback highlights two critical issues: first, the failure to release capacity slots when a worker is cancelled, which leads to state inconsistency and assertion failures; and second, the use of sys.exit(1) within an asyncio task, which does not terminate the process and causes the training loop to hang. Suggestions were provided to use os._exit(1) and to ensure proper slot management during cancellation.
6648859 to
0d84986
Compare
The exception handler in _run_generate_for_a_group_loop had two bugs: 1. `raise RuntimeError()` inside asyncio.create_task() is silently swallowed — exceptions in tasks are only raised if you await them. This prevented sys.exit(1) from ever being reached. 2. sys.exit(1) itself raises SystemExit which is also captured by the asyncio task machinery. Use os._exit(1) to force-terminate. Additionally: - Release capacity slot on CancelledError via on_rollout_rejected() to prevent assertion failures in validate_state_at_epoch_end. - Reset slot_acquired after on_rollout_accepted() to avoid stale state across loop iterations. Without this fix, a single EngineDeadError causes all GPUs to idle indefinitely while the process hangs in the asyncio event loop.
3cf34ad to
8558837
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
raise RuntimeError()insideasyncio.create_task()being silently swallowedos._exit(1)instead ofsys.exit(1)to force-terminate from async task contextCancelledErrorto prevent assertion failures invalidate_state_at_epoch_endslot_acquiredafteron_rollout_accepted()to avoid stale state across loop iterationsProblem
_run_generate_for_a_group_loopruns insideasyncio.create_task(). When vLLM'sEngineDeadErroroccurs during generation,slot_acquiredis True, so the handler executes:This exception is silently swallowed by asyncio (task exceptions only surface if you
awaitthe task). Thesys.exit(1)on the next line is never reached becauseraiseexits the except block.Even if
sys.exit(1)were reached, it raisesSystemExitwhich is also captured by the asyncio task machinery and does not terminate the process.The training process hangs indefinitely with all GPUs idle.
Fix
raise RuntimeError()guards that prevented process terminationos._exit(1)which bypasses asyncio and forces immediate terminationCancelledError, callon_rollout_rejected()to release the capacity slotslot_acquired = Falseafteron_rollout_accepted()for correct state trackingImpact
Without this fix, a single
EngineDeadErrorcauses all training GPUs to sit idle indefinitely. Observed in production: 128 H100 GPUs were idle for 5+ days due to this bug.Test plan