Skip to content

Question about SFT setting #16

@Stephen0808

Description

@Stephen0808

Thanks for your great work!

I’ve successfully reproduced the results on Qwen3-8B using your SFT settings via LLaMa-Factory, and the performance is indeed impressive.

However, I noticed a specific parameter in the configuration: max_grad_norm is set to 2e-5. This is quite unusual for LLM fine-tuning, where values like 1.0 or 0.5 are standard. Given that such a small threshold significantly clips the gradients, I’m curious about the intuition behind this choice: How does this extremely low threshold impact the convergence speed and final model performance in your experiments? What's your motivation behind it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions