Skip to content

Record: 11L XSA6 + Warmdown3000 + QAT@0.30 (val_bpb=1.1352, 2-seed mean)#695

Open
0xNoramiya wants to merge 1 commit intoopenai:mainfrom
0xNoramiya:submission/xsa6-wd3000-qat030
Open

Record: 11L XSA6 + Warmdown3000 + QAT@0.30 (val_bpb=1.1352, 2-seed mean)#695
0xNoramiya wants to merge 1 commit intoopenai:mainfrom
0xNoramiya:submission/xsa6-wd3000-qat030

Conversation

@0xNoramiya
Copy link

11L XSA6 + Warmdown3000 + QAT@0.30 (val_bpb: 1.1352)

val_bpb: 1.1360 (sliding window stride=64, 2-seed mean) | 15.88 MB | 8xH100 SXM, 600s

Changes from SOTA (PR #414)

Three targeted hyperparameter changes identified through 37 local ablation experiments on an RTX 4060 Ti:

Change SOTA (PR #414) Ours Rationale
XSA layers last 4 last 6 More context-only attention layers
Warmdown 3500 iters 3000 iters Shorter cooldown preserves more full-LR training
Late QAT threshold 0.15 0.30 Earlier QAT gives more steps to adapt to int6

Results (2 seeds, 8xH100 SXM)

Seed Steps ms/step Sliding BPB (s64) Artifact
42 5,447 110.2 1.1352 15,883,805 bytes
1337 5,448 110.1 1.1367 15,730,868 bytes

Mean: 1.1360 | Std: 0.0008 | Submitted: seed 42 (best)

Note: ~5,400 steps (110ms/step) vs SOTA's ~7,100 (85ms/step) due to PyTorch SDPA fallback — FlashAttention 3 was unavailable in our deployment environment. With FA3, we would expect ~7,000 steps and correspondingly lower BPB.

Methodology

Hyperparameters selected via 37 ablation experiments on a single RTX 4060 Ti (500 steps each) across 8 dimensions: BigramHash buckets, EMA decay, warmdown ratio, matrix LR, gradient clip, QAT threshold, XSA layers, and Muon momentum. Key finding: EMA decay and LR are step-count-dependent (don't transfer from local to H100), while warmdown ratio and QAT threshold do transfer.

Compliance

  • 2 seeds on 8xH100 SXM, <=600s each
  • All artifacts <=16,000,000 bytes (max: 15,883,805)
  • Sliding window eval stride=64
  • No test-time training on validation data
  • No network calls during evaluation
  • Self-contained train_gpt.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant