Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean) by stukenov · Pull Request #733 · openai/parameter-golf

stukenov · 2026-03-25T15:24:00Z

Record: XSA-all + VRL + CROWN-Q + Depth Recurrence + Hedge Mixer TTT

val_bpb = 1.0278 (3-seed mean, std 0.0039) | ~15.8 MB | 8xH100 SXM, 600s train

3-Seed Results

Seed	Pre-TTT bpb	Post-TTT bpb	Artifact
1337	1.1335	1.0235	15,827,512
42	1.1346	1.0289	15,760,352
2025	1.1365	1.0311	15,713,536
Mean	1.1349	1.0278 (std 0.0039)

Key Innovations (6 additions over PR #549)

XSA on all 11 layers (PR Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) #634) — -0.006 BPB
Value Residual Learning (PR Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) #657) — -0.002 BPB
Gated Attention (PR Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed) #638) — -0.002 BPB
CROWN-Q (PR Record: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean) #693) — curvature-weighted quant penalty during warmdown
Depth Recurrence (PR Record: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182 #686) — layers 4,5 repeated = 13 virtual layers from 11 physical
5-Expert Hedge Mixer (PR Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #688) — GPU-vectorized online context mixing (neural + unigram + bigram + trigram + entropy)

Legal TTT (Score-First)

Every token scored under torch.inference_mode() BEFORE any weight update. Hedge Mixer n-gram tables built from already-scored tokens only. SGD optimizer (not AdamW) for TTT.

Note on eval time

TTT eval takes ~755s (exceeds 600s limit). Reducing TTT_EPOCHS from 3 to 1 brings eval under 600s with expected BPB ~1.08-1.09. Happy to resubmit with 1 epoch if required.

Reproduction

SEED=1337 torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-03-25_v4_XSA11_VRL_CROWNQ_DepthRecur_HedgeMixer_TTT/train_gpt.py

All defaults in the script match the submitted results. No env vars needed.

Credits

PR #549 (@abaybektursun), PR #634 (@raahilshah), PR #657 (@anthony-maio), PR #638 (@Asukabot0), PR #693 (@EthanYangTW), PR #686 (@msisovic), PR #688 (@RoyiRa), PR #493 (@parinzee), PR #414 (@signalrush)

…(val_bpb=1.0278, 3-seed mean) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ucibility Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

stukenov · 2026-03-25T15:27:02Z

Closing: eval time exceeds 600s limit. Resubmitting with TTT_EPOCHS=1.

- Per-head learned gate in attention (PR openai#638/openai#733): -0.002 BPB - Lambda_v * x0 shortcut from initial embedding (PR openai#657/openai#733): -0.002 BPB - Both enabled by default via GATED_ATTENTION=1, VALUE_RESIDUAL=1 - Added attn_gate, lambda_v to control tensor patterns for proper quantization handling - All smoke tests pass on CPU

…eader Major additions: - Depth recurrence: layers 4,5 repeated -> 13 virtual from 11 physical Repeat blocks share heavy CastedLinear weights, own scalar params untie_recurrence() deep-copies before TTT for independent specialization Only ~1% param overhead during training - TTT defaults changed to match PR openai#733 winning recipe: - SGD optimizer (was AdamW) - simpler, less memory - lr=0.002 (was 0.0005) - higher for SGD - Unfreeze all 11 blocks (was 2) - more params for adaptation - All repeat_blocks params unfrozen for TTT Configurable via: RECUR_LAYERS="4,5" TTT_OPTIMIZER=sgd TTT_LR=0.002 All smoke tests pass on CPU (syntax, recurrence, weight sharing, untie).

stukenov and others added 2 commits March 25, 2026 20:22

Record: XSA-all + VRL + CROWN-Q + Depth Recurrence + Hedge Mixer TTT …

9de128e

…(val_bpb=1.0278, 3-seed mean) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: set TTT_ENABLED=1 and TTT_FREEZE_BLOCKS=0 as defaults for reprod…

cada1c8

…ucibility Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

stukenov closed this Mar 25, 2026

pappanick mentioned this pull request Mar 26, 2026

Learned Routing + Two-Pass N-gram Rescoring + Extended Orders (2-12) #860

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean)#733

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean)#733
stukenov wants to merge 2 commits intoopenai:mainfrom
stukenov:submission/v4-hedge-mixer-ttt

stukenov commented Mar 25, 2026

Uh oh!

stukenov commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stukenov commented Mar 25, 2026

Record: XSA-all + VRL + CROWN-Q + Depth Recurrence + Hedge Mixer TTT

3-Seed Results

Key Innovations (6 additions over PR #549)

Legal TTT (Score-First)

Note on eval time

Reproduction

Credits

Uh oh!

stukenov commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant