GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean)#430
Closed
sahiee-dev wants to merge 21 commits intoopenai:mainfrom
Closed
GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean)#430sahiee-dev wants to merge 21 commits intoopenai:mainfrom
sahiee-dev wants to merge 21 commits intoopenai:mainfrom
Conversation
Dropped QAT: 8% throughput penalty kills 600s budget (per PR openai#360). Three novel additions on thwu1 SOTA base (1.1428): - TrigramHash(20480, dim=32): trigram embedding signal, bigram 10240->4096 - XSA: orthogonal self-value removal, last 4 layers, from PR openai#287 - TTT: 3-epoch SGD on val tokens before eval, all ranks, ~47s budget Fixed rank bug: TTT runs on all 8 ranks independently (not rank 0 only) Artifact: ~15.64MB. Smoke tests passing. H100 validation pending.
New addition: EMA (decay=0.9999) shadow model, eval uses EMA weights. EMA coexists with SWA. Zero artifact cost. Consistent with PR openai#338 (best open PR, 1.1254 bpb) which also uses EMA. 11th layer ruled out: needs ~0.91MB, only ~0.36MB budget available. Full stack on thwu1 base (1.1428): - TrigramHash(20480, dim=32): trigram embeddings, bigram 10240->4096 - XSA: orthogonal self-value removal, last 4 layers (PR openai#287) - EMA: decay=0.9999, shadow model used at final eval - TTT: 3-epoch SGD on val tokens, all ranks, ~47s budget Artifact: ~15.64MB. H100 validation pending.
T4 ablation (1000 steps, 4 variants): V2 bigram=10240 no trigram: 5.4379 loss WINNER V4 bigram=8192 + trigram=8192: 5.6956 loss V3 bigram=4096 + trigram=20480: 5.7924 loss (was our submission) V1 bigram=4096 no trigram: 5.8414 loss TrigramHash adds noise, bigram reduction actively hurts. Restored bigram=10240. Stack is now: XSA + EMA + TTT on thwu1 base. These are proven techniques (XSA from PR openai#287, EMA+TTT from PR openai#338 lineage) applied cleanly on the openai#1 submission.
Full stack on thwu1 base (1.1428): - Value Residual: lambda_v * v0 shortcut to every block, init=0 - Gated Attention: learned scalar gate on attn output, init=1 - XSA: orthogonal self-value removal, last 4 layers - EMA: decay=0.9999 shadow model used at final eval - AdamW TTT: lr=0.001, 3 epochs on val tokens before eval - BigramHash(10240): restored to full size after ablation Techniques consistent with PR openai#490 (1.0891) and PR openai#486 (1.0887). Expected range: 1.08-1.10 on 8xH100s. Trigram ablation confirmed negative at small scale — removed.
…A — ready for H100 validation
…found 16.3MB at 2.75)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
XSA6 + BigramHash4K + GatedAttn + ValueResidual on HedgeMixer Stack
Summary
Forked from @agalimova's PR #720 (XSA6 + BigramHash4K + HedgeMixer baseline at 1.1078).
Added two lightweight architectural modifications that improved performance by ~0.018 bpb.
Architectural Changes
attn_gate): A per head learned FP32 scalar (init=1.0) multiplied against the attention output, allowing the model to learn head specific contribution magnitudes.lambda_v): A per block learned FP32 scalar (init=0.0) that injects a fraction of the initial token embeddingx0directly into the residual stream.Both initialize as strict no ops and are registered in
CONTROL_TENSOR_NAME_PATTERNSto remain in FP32 and bypass GPTQ quantization.Results
Artifact: 14,917,177 bytes (14.9MB). All seeds evaluated under 600s.
Evaluation
EVAL_STRIDE=64(matches official baseline default)TTT Legality
TTT follows Case 3 (legal) per issue #402: tokens are scored before any adaptation, documents are evaluated independently with no cross document leakage. The dependency graph is identical to standard autoregressive eval.