Skip to content

Recorded generation metrics unnecessarily lost when RayPPOTrainer.fwd_logprobs_values_reward raises #1687

@jamesbraza

Description

@jamesbraza

With SkyRL v0.2.0, I had a training run with:

  1. A successful generation period taking 42-min. It completed all rollouts and collected valuable information (e.g. reward signal)
  2. Hit an OOM (torch.OutOfMemoryError) during RayPPOTrainer.fwd_logprobs_values_reward's reference model forward pass: https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/train/trainer.py#L1006

Due to the OOM, MLFlow never received metrics, even though many were already present in RayPPOTrainer.all_metrics:

  • Reward metrics: reward/avg_raw_reward, reward/mean_positive_reward, etc.
  • Timing metrics: timing/generate, timing/convert_to_training_input, etc.
  • Many metrics in generator_output["rollout_metrics"]

This request is for SkyRL to better log so we don't have total information loss. Basically, can SkyRL add a finally branch that calls tracker.log when there is an unhandled exception, so metrics get preserved as the system goes down:

  1. Put this in a try: https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/train/trainer.py#L215
  2. Put tracker.log in a finally, to handle both normal exit and/or teardown: https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/train/trainer.py#L338

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions