GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean) by sahiee-dev · Pull Request #430 · openai/parameter-golf

sahiee-dev · 2026-03-22T15:12:45Z

XSA6 + BigramHash4K + GatedAttn + ValueResidual on HedgeMixer Stack

Summary

Forked from @agalimova's PR #720 (XSA6 + BigramHash4K + HedgeMixer baseline at 1.1078).
Added two lightweight architectural modifications that improved performance by ~0.018 bpb.

Architectural Changes

Gated Attention (attn_gate): A per head learned FP32 scalar (init=1.0) multiplied against the attention output, allowing the model to learn head specific contribution magnitudes.
Value Residual (lambda_v): A per block learned FP32 scalar (init=0.0) that injects a fraction of the initial token embedding x0 directly into the residual stream.

Both initialize as strict no ops and are registered in CONTROL_TENSOR_NAME_PATTERNS to remain in FP32 and bypass GPTQ quantization.

Results

Seed	val_bpb	val_loss	eval_time
42	1.08778131	1.83666831	504s
1337	1.09024766	1.84083264	503s
2025	1.09090710	1.84194607	506s
mean	1.08964536	1.83981567	—

Artifact: 14,917,177 bytes (14.9MB). All seeds evaluated under 600s.

Evaluation

EVAL_STRIDE=64 (matches official baseline default)
All runs completed in ~503–506s (under the 600s hard limit)
Hardware: 8×H100 SXM 80GB
Compression: zstd level-22

TTT Legality

TTT follows Case 3 (legal) per issue #402: tokens are scored before any adaptation, documents are evaluated independently with no cross document leakage. The dependency graph is identical to standard autoregressive eval.

Dropped QAT: 8% throughput penalty kills 600s budget (per PR openai#360). Three novel additions on thwu1 SOTA base (1.1428): - TrigramHash(20480, dim=32): trigram embedding signal, bigram 10240->4096 - XSA: orthogonal self-value removal, last 4 layers, from PR openai#287 - TTT: 3-epoch SGD on val tokens before eval, all ranks, ~47s budget Fixed rank bug: TTT runs on all 8 ranks independently (not rank 0 only) Artifact: ~15.64MB. Smoke tests passing. H100 validation pending.

New addition: EMA (decay=0.9999) shadow model, eval uses EMA weights. EMA coexists with SWA. Zero artifact cost. Consistent with PR openai#338 (best open PR, 1.1254 bpb) which also uses EMA. 11th layer ruled out: needs ~0.91MB, only ~0.36MB budget available. Full stack on thwu1 base (1.1428): - TrigramHash(20480, dim=32): trigram embeddings, bigram 10240->4096 - XSA: orthogonal self-value removal, last 4 layers (PR openai#287) - EMA: decay=0.9999, shadow model used at final eval - TTT: 3-epoch SGD on val tokens, all ranks, ~47s budget Artifact: ~15.64MB. H100 validation pending.

T4 ablation (1000 steps, 4 variants): V2 bigram=10240 no trigram: 5.4379 loss WINNER V4 bigram=8192 + trigram=8192: 5.6956 loss V3 bigram=4096 + trigram=20480: 5.7924 loss (was our submission) V1 bigram=4096 no trigram: 5.8414 loss TrigramHash adds noise, bigram reduction actively hurts. Restored bigram=10240. Stack is now: XSA + EMA + TTT on thwu1 base. These are proven techniques (XSA from PR openai#287, EMA+TTT from PR openai#338 lineage) applied cleanly on the openai#1 submission.

Full stack on thwu1 base (1.1428): - Value Residual: lambda_v * v0 shortcut to every block, init=0 - Gated Attention: learned scalar gate on attn output, init=1 - XSA: orthogonal self-value removal, last 4 layers - EMA: decay=0.9999 shadow model used at final eval - AdamW TTT: lr=0.001, 3 epochs on val tokens before eval - BigramHash(10240): restored to full size after ablation Techniques consistent with PR openai#490 (1.0891) and PR openai#486 (1.0887). Expected range: 1.08-1.10 on 8xH100s. Trigram ablation confirmed negative at small scale — removed.

open PR (1.0672)

…approach

…pattern

…A — ready for H100 validation

…iding eval

…found 16.3MB at 2.75)

…seeds <510s eval

sahiee-dev changed the title ~~TrigramHash + XSA + TTT on thwu1 SOTA stack — val_bpb pending H100~~ TrigramHash + XSA + TTT on thwu1 SOTA stack - val_bpb pending H100 Mar 22, 2026

sahiee-dev added 3 commits March 23, 2026 08:11

sahiee-dev changed the title ~~TrigramHash + XSA + TTT on thwu1 SOTA stack - val_bpb pending H100~~ Value Residual + Gated Attention + XSA + EMA + AdamW TTT — val_bpb pending H100 Mar 23, 2026

sahiee-dev added 17 commits March 23, 2026 12:25

Replace ReLU² with SwiGLU — iso-param, consistent with JoeProAI openai#1

abec99e

open PR (1.0672)

Fix serialization path + EMA every-10-steps + verify eval tokenizer

c4dff44

Reduce artifact size: bigram 10240->7168, mlp_mult 3.0->2.75 to fit 16MB

599f989

Replace AdamW TTT with LoRA TTT — consistent with PROTEUS v7 sub-1.0 …

9290620

…approach

Fix LoRA TTT dtype: cast tokens to long before embedding lookup

ed3976c

Fix LoRA TTT device: create lora params on same device as base layer

685f464

Fix LoRA dtype: cast to input dtype at runtime matching CastedLinear …

8acadd0

…pattern

Add per-document LoRA reset + multi-epoch TTT matching PROTEUS v7

d93fff7

Fixes

53410dd

Pre-flight audit complete: 11L GPTQ-lite LeakyReLU2 VE128 XSA4 EMA SW…

87f0a74

…A — ready for H100 validation

Fix EMA decay 0.999, remove diagnostic eval, add SWA apply, single sl…

c238975

…iding eval

Adjust mlp_mult to 2.5: ensures <16MB limit even with zlib fallback (…

474e715

…found 16.3MB at 2.75)

fix(eval): strictly cap TTT loop at 585s to obey the 600s budget

61e1497

fix(eval): set default EVAL_STRIDE to 64 to comply with 600s time limit

6efe800

docs: finalize mathematically sound 1.089 submission at stride=64

6bd838c

Add 3-seed validated results: val_bpb=1.08965 (mean), stride=64, all …

787d91a

…seeds <510s eval

Remove stale training logs from earlier broken runs

e0bb22c

sahiee-dev changed the title ~~Value Residual + Gated Attention + XSA + EMA + AdamW TTT — val_bpb pending H100~~ GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean) Mar 26, 2026

sahiee-dev marked this pull request as ready for review March 26, 2026 06:55

sahiee-dev marked this pull request as draft March 26, 2026 06:56

sahiee-dev closed this Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean)#430

GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean)#430
sahiee-dev wants to merge 21 commits intoopenai:mainfrom
sahiee-dev:main

sahiee-dev commented Mar 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sahiee-dev commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

XSA6 + BigramHash4K + GatedAttn + ValueResidual on HedgeMixer Stack

Summary

Architectural Changes

Results

Evaluation

TTT Legality

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sahiee-dev commented Mar 22, 2026 •

edited

Loading