Skip to content

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean)#733

Closed
stukenov wants to merge 2 commits intoopenai:mainfrom
stukenov:submission/v4-hedge-mixer-ttt
Closed

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean)#733
stukenov wants to merge 2 commits intoopenai:mainfrom
stukenov:submission/v4-hedge-mixer-ttt

Conversation

@stukenov
Copy link

Record: XSA-all + VRL + CROWN-Q + Depth Recurrence + Hedge Mixer TTT

val_bpb = 1.0278 (3-seed mean, std 0.0039) | ~15.8 MB | 8xH100 SXM, 600s train

3-Seed Results

Seed Pre-TTT bpb Post-TTT bpb Artifact
1337 1.1335 1.0235 15,827,512
42 1.1346 1.0289 15,760,352
2025 1.1365 1.0311 15,713,536
Mean 1.1349 1.0278 (std 0.0039)

Key Innovations (6 additions over PR #549)

  1. XSA on all 11 layers (PR Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) #634) — -0.006 BPB
  2. Value Residual Learning (PR Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) #657) — -0.002 BPB
  3. Gated Attention (PR Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed) #638) — -0.002 BPB
  4. CROWN-Q (PR Record: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean) #693) — curvature-weighted quant penalty during warmdown
  5. Depth Recurrence (PR Record: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182 #686) — layers 4,5 repeated = 13 virtual layers from 11 physical
  6. 5-Expert Hedge Mixer (PR Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #688) — GPU-vectorized online context mixing (neural + unigram + bigram + trigram + entropy)

Legal TTT (Score-First)

Every token scored under torch.inference_mode() BEFORE any weight update. Hedge Mixer n-gram tables built from already-scored tokens only. SGD optimizer (not AdamW) for TTT.

Note on eval time

TTT eval takes ~755s (exceeds 600s limit). Reducing TTT_EPOCHS from 3 to 1 brings eval under 600s with expected BPB ~1.08-1.09. Happy to resubmit with 1 epoch if required.

Reproduction

SEED=1337 torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-03-25_v4_XSA11_VRL_CROWNQ_DepthRecur_HedgeMixer_TTT/train_gpt.py

All defaults in the script match the submitted results. No env vars needed.

Credits

PR #549 (@abaybektursun), PR #634 (@raahilshah), PR #657 (@anthony-maio), PR #638 (@Asukabot0), PR #693 (@EthanYangTW), PR #686 (@msisovic), PR #688 (@RoyiRa), PR #493 (@parinzee), PR #414 (@signalrush)

stukenov and others added 2 commits March 25, 2026 20:22
…(val_bpb=1.0278, 3-seed mean)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ucibility

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@stukenov
Copy link
Author

Closing: eval time exceeds 600s limit. Resubmitting with TTT_EPOCHS=1.

@stukenov stukenov closed this Mar 25, 2026
pappanick added a commit to pappanick/parameter-golf that referenced this pull request Mar 26, 2026
- Per-head learned gate in attention (PR openai#638/openai#733): -0.002 BPB
- Lambda_v * x0 shortcut from initial embedding (PR openai#657/openai#733): -0.002 BPB
- Both enabled by default via GATED_ATTENTION=1, VALUE_RESIDUAL=1
- Added attn_gate, lambda_v to control tensor patterns for proper quantization handling
- All smoke tests pass on CPU
pappanick added a commit to pappanick/parameter-golf that referenced this pull request Mar 26, 2026
…eader

Major additions:
- Depth recurrence: layers 4,5 repeated -> 13 virtual from 11 physical
  Repeat blocks share heavy CastedLinear weights, own scalar params
  untie_recurrence() deep-copies before TTT for independent specialization
  Only ~1% param overhead during training
- TTT defaults changed to match PR openai#733 winning recipe:
  - SGD optimizer (was AdamW) - simpler, less memory
  - lr=0.002 (was 0.0005) - higher for SGD
  - Unfreeze all 11 blocks (was 2) - more params for adaptation
- All repeat_blocks params unfrozen for TTT

Configurable via: RECUR_LAYERS="4,5" TTT_OPTIMIZER=sgd TTT_LR=0.002

All smoke tests pass on CPU (syntax, recurrence, weight sharing, untie).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant