Skip to content

GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean)#430

Closed
sahiee-dev wants to merge 21 commits intoopenai:mainfrom
sahiee-dev:main
Closed

GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean)#430
sahiee-dev wants to merge 21 commits intoopenai:mainfrom
sahiee-dev:main

Conversation

@sahiee-dev
Copy link
Copy Markdown

@sahiee-dev sahiee-dev commented Mar 22, 2026

XSA6 + BigramHash4K + GatedAttn + ValueResidual on HedgeMixer Stack

Summary

Forked from @agalimova's PR #720 (XSA6 + BigramHash4K + HedgeMixer baseline at 1.1078).
Added two lightweight architectural modifications that improved performance by ~0.018 bpb.

Architectural Changes

  1. Gated Attention (attn_gate): A per head learned FP32 scalar (init=1.0) multiplied against the attention output, allowing the model to learn head specific contribution magnitudes.
  2. Value Residual (lambda_v): A per block learned FP32 scalar (init=0.0) that injects a fraction of the initial token embedding x0 directly into the residual stream.

Both initialize as strict no ops and are registered in CONTROL_TENSOR_NAME_PATTERNS to remain in FP32 and bypass GPTQ quantization.

Results

Seed val_bpb val_loss eval_time
42 1.08778131 1.83666831 504s
1337 1.09024766 1.84083264 503s
2025 1.09090710 1.84194607 506s
mean 1.08964536 1.83981567

Artifact: 14,917,177 bytes (14.9MB). All seeds evaluated under 600s.

Evaluation

  • EVAL_STRIDE=64 (matches official baseline default)
  • All runs completed in ~503–506s (under the 600s hard limit)
  • Hardware: 8×H100 SXM 80GB
  • Compression: zstd level-22

TTT Legality

TTT follows Case 3 (legal) per issue #402: tokens are scored before any adaptation, documents are evaluated independently with no cross document leakage. The dependency graph is identical to standard autoregressive eval.

Dropped QAT: 8% throughput penalty kills 600s budget (per PR openai#360).

Three novel additions on thwu1 SOTA base (1.1428):
- TrigramHash(20480, dim=32): trigram embedding signal, bigram 10240->4096
- XSA: orthogonal self-value removal, last 4 layers, from PR openai#287
- TTT: 3-epoch SGD on val tokens before eval, all ranks, ~47s budget
  Fixed rank bug: TTT runs on all 8 ranks independently (not rank 0 only)

Artifact: ~15.64MB. Smoke tests passing. H100 validation pending.
@sahiee-dev sahiee-dev changed the title TrigramHash + XSA + TTT on thwu1 SOTA stack — val_bpb pending H100 TrigramHash + XSA + TTT on thwu1 SOTA stack - val_bpb pending H100 Mar 22, 2026
New addition: EMA (decay=0.9999) shadow model, eval uses EMA weights.
EMA coexists with SWA. Zero artifact cost. Consistent with PR openai#338
(best open PR, 1.1254 bpb) which also uses EMA.

11th layer ruled out: needs ~0.91MB, only ~0.36MB budget available.

Full stack on thwu1 base (1.1428):
- TrigramHash(20480, dim=32): trigram embeddings, bigram 10240->4096
- XSA: orthogonal self-value removal, last 4 layers (PR openai#287)
- EMA: decay=0.9999, shadow model used at final eval
- TTT: 3-epoch SGD on val tokens, all ranks, ~47s budget

Artifact: ~15.64MB. H100 validation pending.
T4 ablation (1000 steps, 4 variants):
V2 bigram=10240 no trigram:     5.4379 loss  WINNER
V4 bigram=8192 + trigram=8192:  5.6956 loss
V3 bigram=4096 + trigram=20480: 5.7924 loss  (was our submission)
V1 bigram=4096 no trigram:      5.8414 loss
TrigramHash adds noise, bigram reduction actively hurts.
Restored bigram=10240. Stack is now: XSA + EMA + TTT on thwu1 base.
These are proven techniques (XSA from PR openai#287, EMA+TTT from PR openai#338 lineage)
applied cleanly on the openai#1 submission.
Full stack on thwu1 base (1.1428):
- Value Residual: lambda_v * v0 shortcut to every block, init=0
- Gated Attention: learned scalar gate on attn output, init=1
- XSA: orthogonal self-value removal, last 4 layers
- EMA: decay=0.9999 shadow model used at final eval
- AdamW TTT: lr=0.001, 3 epochs on val tokens before eval
- BigramHash(10240): restored to full size after ablation

Techniques consistent with PR openai#490 (1.0891) and PR openai#486 (1.0887).
Expected range: 1.08-1.10 on 8xH100s.
Trigram ablation confirmed negative at small scale — removed.
@sahiee-dev sahiee-dev changed the title TrigramHash + XSA + TTT on thwu1 SOTA stack - val_bpb pending H100 Value Residual + Gated Attention + XSA + EMA + AdamW TTT — val_bpb pending H100 Mar 23, 2026
@sahiee-dev sahiee-dev changed the title Value Residual + Gated Attention + XSA + EMA + AdamW TTT — val_bpb pending H100 GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean) Mar 26, 2026
@sahiee-dev sahiee-dev marked this pull request as ready for review March 26, 2026 06:55
@sahiee-dev sahiee-dev marked this pull request as draft March 26, 2026 06:56
@sahiee-dev sahiee-dev closed this Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant