Questions about QAT Data Composition

Hi, thanks for sharing this impressive work. 

I have a few questions regarding the 2-bit QAT data setup and would appreciate some clarification:

**1. SFT data usage in QAT**

- Is the 2-bit QAT training conducted entirely on SFT data?

- When you mention “50% of SFT corpus,” does this mean 50% of the full SFT corpus?

- Could you clarify the scale of the full SFT corpus (e.g., total tokens or samples)?

**2. Relationship with the 89B curated dataset**

- You mentioned curating an optimized dataset of 89B tokens — is this dataset derived from the full SFT corpus, or is it a separate hybrid dataset (e.g., including pretraining or other sources)?
- Is this 89B dataset actually used during QAT, or only for earlier stages?

**3. Training data scale**
- The paper mentions using only 10% of the data compared to BitNet-2B. Does the 2-bit QAT training use approximately ~400B tokens in total?

Thanks in advance for the clarification!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about QAT Data Composition #277

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions about QAT Data Composition #277

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions