With SkyRL v0.2.0, I had a training run with:
- A successful generation period taking 42-min. It completed all rollouts and collected valuable information (e.g. reward signal)
- Hit an OOM (
torch.OutOfMemoryError) during RayPPOTrainer.fwd_logprobs_values_reward's reference model forward pass: https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/train/trainer.py#L1006
Due to the OOM, MLFlow never received metrics, even though many were already present in RayPPOTrainer.all_metrics:
- Reward metrics:
reward/avg_raw_reward, reward/mean_positive_reward, etc.
- Timing metrics:
timing/generate, timing/convert_to_training_input, etc.
- Many metrics in
generator_output["rollout_metrics"]
This request is for SkyRL to better log so we don't have total information loss. Basically, can SkyRL add a finally branch that calls tracker.log when there is an unhandled exception, so metrics get preserved as the system goes down:
- Put this in a
try: https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/train/trainer.py#L215
- Put
tracker.log in a finally, to handle both normal exit and/or teardown: https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/train/trainer.py#L338
With SkyRL v0.2.0, I had a training run with:
torch.OutOfMemoryError) duringRayPPOTrainer.fwd_logprobs_values_reward's reference model forward pass: https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/train/trainer.py#L1006Due to the OOM, MLFlow never received metrics, even though many were already present in
RayPPOTrainer.all_metrics:reward/avg_raw_reward,reward/mean_positive_reward, etc.timing/generate,timing/convert_to_training_input, etc.generator_output["rollout_metrics"]This request is for SkyRL to better log so we don't have total information loss. Basically, can SkyRL add a
finallybranch that callstracker.logwhen there is an unhandled exception, so metrics get preserved as the system goes down:try: https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/train/trainer.py#L215tracker.login afinally, to handle both normal exit and/or teardown: https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/train/trainer.py#L338