Progressive Depth + Hedge Mixer — val_bpb 1.1454 by iverbovoy · Pull Request #856 · openai/parameter-golf

iverbovoy · 2026-03-26T14:51:46Z

Summary

Hedge Mixer: 5-expert online ensemble at eval time (neural + unigram + bigram + trigram + entropy) via Hedge algorithm. -0.051 bpb over sliding window eval. N-gram tables built from already-scored tokens only.
Tuned hyperparameters: LR 0.012→0.018, WARMDOWN 3000→2000 (-0.002 bpb).
Builds on PRs Depth Recurrence + Cross-Repeat Skip + Sliding Window Eval #148, Non-record: Depth Recurrence + XSA + LeakyReLU² (val_bpb 1.2065) #784, Progressive Depth Training — val_bpb 1.1980 #835 (Progressive Depth + Cross-Repeat Skip + XSA + LeakyReLU²).
Hedge Mixer adapted from PR Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #688 (@RoyiRa) and PR Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean) #745 (@stukenov).

Results

Eval method	val_bpb
Standard roundtrip	1.2304
Sliding window (stride=256)	1.1966
Sliding + Hedge Mixer	1.1454

Training: 600s, 5673 steps on 8xH100. Eval (Hedge): 579s on 8xH100. Artifact: 15.88MB.

Ablation Trajectory

Change	val_bpb	Δ
OpenAI Naive Baseline	1.2244	—
Depth Recurrence 3×4 + Cross-Repeat Skip (#148)	1.2213	-0.003
+ XSA + LeakyReLU² (#784)	1.2069	-0.014
+ Progressive Depth (#835)	1.1980	-0.009
+ LR/Warmdown tuning	1.1960	-0.002
+ Hedge Mixer (eval)	1.1454	-0.051
Total	1.1454	-0.079

Test plan

Full 8xH100 run, 600s train + 579s eval
Hedge Mixer uses only already-scored tokens (score-first, no training data access)
Artifact 15.88MB < 16MB limit
Eval time 579s < 600s limit

- Replace 9 unique blocks with 3 blocks x 4 repeats (12 effective layers) - Increase dim from 512 to 832, remove U-Net skips - Add loop_embed for timestep encoding per effective layer - Add cross-repeat skip: each block mixes in its output from previous repeat with per-repeat learned scales (stateful recurrence) - Add 2 value embedding tables mixed into each layer with learned scales - 17.14M params, best result: 1.6780 bpb (int8+zlib) on 2000 steps batch 8K

- Add eval_val_ttt: adapts model on each val batch before evaluating - For each batch: save weights → K gradient steps → evaluate → restore - Controlled by TTT_STEPS (default 0 = disabled) and TTT_LR (default 1e-4) - Result: -0.010 bpb improvement on 200-step test (2.4124 → 2.4027) - TTT eval runs after normal roundtrip eval, reports both scores

- Sliding window eval: window=1024, stride=256, ~-0.034 bpb - forward_logits() method for sliding window support - LR x0.3: matrix=0.012, embed=0.015, scalar=0.012 (sweep winner) - GRAD_CLIP_NORM=0.3 for recurrence stability - WARMDOWN_ITERS=3000 - train@1024 (not 2048) — better for recurrence (160ms vs 253ms/step) - Fix grad_accum for non-power-of-2 GPU counts - Best result: 1.2308 bpb sliding window on 6xH100 (3726 steps)

- Fix quantization clamp_min(1/ql) -> clamp_min(1e-12) preventing broken roundtrip on undertrained models - Add Muon weight decay (0.04) for training stability - Add SWA with float32 accumulation and final snapshot inclusion - Remove sweep.sh

Improvements over previous submission (1.2196 → 1.2070, -0.014 bpb): - XSA (Exclusive Self-Attention) on last 4 effective layers: -0.010 bpb - LeakyReLU(0.5)² instead of relu²: -0.004 bpb - GPTQ-lite: per-row best-of-5 clip percentiles for quantization - zstd-22 compression instead of zlib (saves ~1.85MB artifact) - SWA tuned to frac=0.4, every=50 Tested on 8xH100, 80 train shards, PyTorch 2.5, 4290 steps.

Improvements over previous submission (1.2196 → 1.2065, -0.013 bpb): - XSA (Exclusive Self-Attention) on last 4 effective layers: -0.010 bpb - LeakyReLU(0.5)² instead of relu²: -0.004 bpb - GPTQ-lite: per-row best-of-5 clip percentiles - zstd-22 compression instead of zlib - SWA tuned to frac=0.4, every=50 8xH100, 80 train shards, 4300 steps, 140ms/step, 15.87MB artifact.

Dynamic depth scheduling unique to shared-weight recurrence: - Phase 1 (0-40%): 2 repeats, ~75ms/step — fast base training - Phase 2 (40-65%): 3 repeats, ~83ms/step — intermediate depth - Phase 3 (65-100%): 4 repeats, ~100ms/step — full recurrence 5981 steps vs 4300 without progressive depth (+39%). SWA collected only at full depth (last phase) to avoid mixing phases. Removed unused TTT eval code. 8xH100, 80 train shards, sliding 1.1973 (-0.009 vs previous 1.2065).

Progressive depth scheduling (2→3→4 repeats) unique to shared-weight recurrence. 5861 steps in 600s vs ~4300 at constant depth (+36%). Fix DDP race condition in phase switching via all_reduce sync.

Systematic tuning on 8xH100 (6 runs): - WARMDOWN_ITERS 3000→2000: full LR at phase 4 entry (-0.0009) - MATRIX/SCALAR_LR 0.012→0.018: higher LR for progressive depth (-0.0011) - Combined: val_bpb 1.1960 sliding (-0.0020 from 1.1980) Tested and rejected: schedule changes (3-phase optimal), SWA_EVERY=25, 5 repeats, GRAD_CLIP=0.5, VRL, per-repeat LoRA (artifact >16MB).

5-expert online ensemble (neural + unigram + bigram + trigram + entropy) via Hedge algorithm at eval time. -0.051 bpb over sliding window. Tuned defaults: LR=0.018, WARMDOWN=2000 (-0.002 from previous). Total improvement: 1.2244 → 1.1454 (-0.079 from baseline).

iverbovoy added 11 commits March 20, 2026 03:37

Add submission: Depth Recurrence + Cross-Repeat Skip + Sliding Window

fa29306

Add SWA, Muon WD, fix quantization clamp

0f019a1

- Fix quantization clamp_min(1/ql) -> clamp_min(1e-12) preventing broken roundtrip on undertrained models - Add Muon weight decay (0.04) for training stability - Add SWA with float32 accumulation and final snapshot inclusion - Remove sweep.sh

Add submission: Progressive Depth Training — val_bpb 1.1980

c0ce492

Progressive depth scheduling (2→3→4 repeats) unique to shared-weight recurrence. 5861 steps in 600s vs ~4300 at constant depth (+36%). Fix DDP race condition in phase switching via all_reduce sync.

iverbovoy mentioned this pull request Mar 26, 2026

Non-record: 4-Hour Progressive Depth — val_bpb 1.0889 #895

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progressive Depth + Hedge Mixer — val_bpb 1.1454#856

Progressive Depth + Hedge Mixer — val_bpb 1.1454#856
iverbovoy wants to merge 11 commits intoopenai:mainfrom
iverbovoy:hedge-mixer

iverbovoy commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iverbovoy commented Mar 26, 2026

Summary

Results

Ablation Trajectory

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant