[infra] Skip extra forward pass when policy loss does not require `old_action_log_probs`

This happens for async RL or in general when `mini_batch_size` == `train_batch_size` - we unnecessarily do a fwd pass on the old policy model even though the policy loss may not use it.