Hi, thanks for sharing this impressive work.
I have a few questions regarding the 2-bit QAT data setup and would appreciate some clarification:
1. SFT data usage in QAT
-
Is the 2-bit QAT training conducted entirely on SFT data?
-
When you mention “50% of SFT corpus,” does this mean 50% of the full SFT corpus?
-
Could you clarify the scale of the full SFT corpus (e.g., total tokens or samples)?
2. Relationship with the 89B curated dataset
- You mentioned curating an optimized dataset of 89B tokens — is this dataset derived from the full SFT corpus, or is it a separate hybrid dataset (e.g., including pretraining or other sources)?
- Is this 89B dataset actually used during QAT, or only for earlier stages?
3. Training data scale
- The paper mentions using only 10% of the data compared to BitNet-2B. Does the 2-bit QAT training use approximately ~400B tokens in total?
Thanks in advance for the clarification!
Hi, thanks for sharing this impressive work.
I have a few questions regarding the 2-bit QAT data setup and would appreciate some clarification:
1. SFT data usage in QAT
Is the 2-bit QAT training conducted entirely on SFT data?
When you mention “50% of SFT corpus,” does this mean 50% of the full SFT corpus?
Could you clarify the scale of the full SFT corpus (e.g., total tokens or samples)?
2. Relationship with the 89B curated dataset
3. Training data scale
Thanks in advance for the clarification!