Thanks for your great work!
I’ve successfully reproduced the results on Qwen3-8B using your SFT settings via LLaMa-Factory, and the performance is indeed impressive.
However, I noticed a specific parameter in the configuration: max_grad_norm is set to 2e-5. This is quite unusual for LLM fine-tuning, where values like 1.0 or 0.5 are standard. Given that such a small threshold significantly clips the gradients, I’m curious about the intuition behind this choice: How does this extremely low threshold impact the convergence speed and final model performance in your experiments? What's your motivation behind it?
Thanks for your great work!
I’ve successfully reproduced the results on Qwen3-8B using your SFT settings via LLaMa-Factory, and the performance is indeed impressive.
However, I noticed a specific parameter in the configuration: max_grad_norm is set to 2e-5. This is quite unusual for LLM fine-tuning, where values like 1.0 or 0.5 are standard. Given that such a small threshold significantly clips the gradients, I’m curious about the intuition behind this choice: How does this extremely low threshold impact the convergence speed and final model performance in your experiments? What's your motivation behind it?