Question about SFT setting

Thanks for your great work!

I’ve successfully reproduced the results on Qwen3-8B using your SFT settings via LLaMa-Factory, and the performance is indeed impressive.

However, I noticed a specific parameter in the configuration: max_grad_norm is set to 2e-5. This is quite unusual for LLM fine-tuning, where values like 1.0 or 0.5 are standard. Given that such a small threshold significantly clips the gradients, I’m curious about the intuition behind this choice: How does this extremely low threshold impact the convergence speed and final model performance in your experiments? What's your motivation behind it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about SFT setting #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about SFT setting #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions