Port ModernBERT's Flash Attention Implementation for Training

For our own implementation of Llama/Qwen, we'll want to follow the [ModernBERT Flash Attention](https://github.com/huggingface/transformers/blob/6b550462139655d488d4c663086a63e98713c6b9/src/transformers/models/modernbert/modular_modernbert.py#L550) code to handle non-fp16/bf16 inputs.