`FSDPRefWorkerBase` loads in pure bf16 while policy/critic use mixed precision, causing KL NaN on long sequences

In SkyRL v0.2.0's FSDP backend, there's a bit of an asymmetry:
- [`HFModelWrapper.__init__`](https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/backends/skyrl_train/workers/model_wrapper.py#L63) and [`TrainerConfig`](https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/train/config/config.py#L610): default `bf16` to `True`
- [Policy `init_model`](https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/backends/skyrl_train/workers/fsdp/fsdp_worker.py#L190): hardcodes `bf16=False`
- [Critic `init_model`](https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/backends/skyrl_train/workers/fsdp/fsdp_worker.py#L368): hardcodes `bf16=False`
- [Ref `init_model`](https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/backends/skyrl_train/workers/fsdp/fsdp_worker.py#L436): passes `self.cfg.bf16`

This leads to a case where, by default:
- Policy and Critic: mixed precision (weights fp32, forward pass `autocast`s to bf16)
- Ref: pure bf16 (weights bf16, forward pass `autocast`s to bf16)

The asymmetry produces no observable failure on small models (e.g. ≤14B) and short sequences (e.g. ≤16k tokens), but on large dense models with long sequences, the ref's pure-bf16 attention overflows; one bad key/value position then poisons every later position in the sequence (each later position attends back to the bad one).

The cause:
1. Ref's weights in bf16 vs policy weights in fp32, across many (e.g. Qwen3-32B has 64) layers, a rounding error compounds
    - Requires `use_kl_loss=true` and `kl_loss_coef>0`, so the reference model is wired into the policy loss
2. Eventually an attention dot product saturates to +/- inf
3. Turns `log_probs_base` into NaN
4. Contaminates the final loss to be NaN

## Workarounds

Match the hardcoded `bf16=false` for policy and critic.

- Configure `trainer.bf16=false` on the launch CLI
- Patch `HFModelWrapper.__init__` globally to force `bf16=False`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`FSDPRefWorkerBase` loads in pure bf16 while policy/critic use mixed precision, causing KL NaN on long sequences #1694

Workarounds

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

FSDPRefWorkerBase loads in pure bf16 while policy/critic use mixed precision, causing KL NaN on long sequences #1694

Description

Workarounds

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`FSDPRefWorkerBase` loads in pure bf16 while policy/critic use mixed precision, causing KL NaN on long sequences #1694