Skip to content

fix: prevent process hang when vLLM engine dies during async generation#1691

Open
dinhxuanvu wants to merge 1 commit into
NovaSky-AI:mainfrom
dinhxuanvu:vdinh/fix-engine-dead-hang
Open

fix: prevent process hang when vLLM engine dies during async generation#1691
dinhxuanvu wants to merge 1 commit into
NovaSky-AI:mainfrom
dinhxuanvu:vdinh/fix-engine-dead-hang

Conversation

@dinhxuanvu
Copy link
Copy Markdown

@dinhxuanvu dinhxuanvu commented May 19, 2026

Summary

  • Fix process hang caused by raise RuntimeError() inside asyncio.create_task() being silently swallowed
  • Use os._exit(1) instead of sys.exit(1) to force-terminate from async task context
  • Release capacity slot on CancelledError to prevent assertion failures in validate_state_at_epoch_end
  • Reset slot_acquired after on_rollout_accepted() to avoid stale state across loop iterations

Problem

_run_generate_for_a_group_loop runs inside asyncio.create_task(). When vLLM's EngineDeadError occurs during generation, slot_acquired is True, so the handler executes:

raise RuntimeError("Generation workers should only run into error when they finish running.")

This exception is silently swallowed by asyncio (task exceptions only surface if you await the task). The sys.exit(1) on the next line is never reached because raise exits the except block.

Even if sys.exit(1) were reached, it raises SystemExit which is also captured by the asyncio task machinery and does not terminate the process.

The training process hangs indefinitely with all GPUs idle.

Fix

  1. Remove the raise RuntimeError() guards that prevented process termination
  2. Use os._exit(1) which bypasses asyncio and forces immediate termination
  3. On CancelledError, call on_rollout_rejected() to release the capacity slot
  4. Reset slot_acquired = False after on_rollout_accepted() for correct state tracking

Impact

Without this fix, a single EngineDeadError causes all training GPUs to sit idle indefinitely. Observed in production: 128 H100 GPUs were idle for 5+ days due to this bug.

Test plan

  • Verified the diff is minimal (6 insertions, 10 deletions) and only touches exception handling logic
  • Existing tests pass (no behavioral change in the happy path)
  • Integration test: kill a vLLM engine during generation and confirm the process exits instead of hanging

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the exception handling in the _run_generate_for_a_group_loop method of the fully asynchronous trainer, replacing runtime errors with error logging. Review feedback highlights two critical issues: first, the failure to release capacity slots when a worker is cancelled, which leads to state inconsistency and assertion failures; and second, the use of sys.exit(1) within an asyncio task, which does not terminate the process and causes the training loop to hang. Suggestions were provided to use os._exit(1) and to ensure proper slot management during cancellation.

Comment thread skyrl/train/fully_async_trainer.py
Comment thread skyrl/train/fully_async_trainer.py Outdated
@dinhxuanvu dinhxuanvu force-pushed the vdinh/fix-engine-dead-hang branch from 6648859 to 0d84986 Compare May 19, 2026 18:12
The exception handler in _run_generate_for_a_group_loop had two bugs:

1. `raise RuntimeError()` inside asyncio.create_task() is silently
   swallowed — exceptions in tasks are only raised if you await them.
   This prevented sys.exit(1) from ever being reached.

2. sys.exit(1) itself raises SystemExit which is also captured by
   the asyncio task machinery. Use os._exit(1) to force-terminate.

Additionally:
- Release capacity slot on CancelledError via on_rollout_rejected()
  to prevent assertion failures in validate_state_at_epoch_end.
- Reset slot_acquired after on_rollout_accepted() to avoid stale
  state across loop iterations.

Without this fix, a single EngineDeadError causes all GPUs to idle
indefinitely while the process hangs in the asyncio event loop.
@dinhxuanvu dinhxuanvu force-pushed the vdinh/fix-engine-dead-hang branch from 3cf34ad to 8558837 Compare May 20, 2026 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant