Recorded generation metrics unnecessarily lost when `RayPPOTrainer.fwd_logprobs_values_reward` raises

With SkyRL v0.2.0, I had a training run with:
1. A successful generation period taking 42-min. It completed all rollouts and collected valuable information (e.g. reward signal)
2. Hit an OOM (`torch.OutOfMemoryError`) during `RayPPOTrainer.fwd_logprobs_values_reward`'s reference model forward pass: https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/train/trainer.py#L1006

Due to the OOM, MLFlow never received metrics, even though many were already present in `RayPPOTrainer.all_metrics`:
- Reward metrics: `reward/avg_raw_reward`, `reward/mean_positive_reward`, etc.
- Timing metrics: `timing/generate`, `timing/convert_to_training_input`, etc.
- Many metrics in `generator_output["rollout_metrics"]`

This request is for SkyRL to better log so we don't have total information loss. Basically, can SkyRL add a `finally` branch that calls `tracker.log` when there is an unhandled exception, so metrics get preserved as the system goes down:
1. Put this in a `try`: https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/train/trainer.py#L215
2. Put `tracker.log` in a `finally`, to handle both normal exit and/or teardown: https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/train/trainer.py#L338

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recorded generation metrics unnecessarily lost when `RayPPOTrainer.fwd_logprobs_values_reward` raises #1687

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Recorded generation metrics unnecessarily lost when RayPPOTrainer.fwd_logprobs_values_reward raises #1687

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Recorded generation metrics unnecessarily lost when `RayPPOTrainer.fwd_logprobs_values_reward` raises #1687