Skip to content

Questions about QAT Data Composition #277

@xyuany

Description

@xyuany

Hi, thanks for sharing this impressive work.

I have a few questions regarding the 2-bit QAT data setup and would appreciate some clarification:

1. SFT data usage in QAT

  • Is the 2-bit QAT training conducted entirely on SFT data?

  • When you mention “50% of SFT corpus,” does this mean 50% of the full SFT corpus?

  • Could you clarify the scale of the full SFT corpus (e.g., total tokens or samples)?

2. Relationship with the 89B curated dataset

  • You mentioned curating an optimized dataset of 89B tokens — is this dataset derived from the full SFT corpus, or is it a separate hybrid dataset (e.g., including pretraining or other sources)?
  • Is this 89B dataset actually used during QAT, or only for earlier stages?

3. Training data scale

  • The paper mentions using only 10% of the data compared to BitNet-2B. Does the 2-bit QAT training use approximately ~400B tokens in total?

Thanks in advance for the clarification!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions