Progressive Depth + Hedge Mixer — val_bpb 1.1454#856
Open
iverbovoy wants to merge 11 commits intoopenai:mainfrom
Open
Progressive Depth + Hedge Mixer — val_bpb 1.1454#856iverbovoy wants to merge 11 commits intoopenai:mainfrom
iverbovoy wants to merge 11 commits intoopenai:mainfrom
Conversation
- Replace 9 unique blocks with 3 blocks x 4 repeats (12 effective layers) - Increase dim from 512 to 832, remove U-Net skips - Add loop_embed for timestep encoding per effective layer - Add cross-repeat skip: each block mixes in its output from previous repeat with per-repeat learned scales (stateful recurrence) - Add 2 value embedding tables mixed into each layer with learned scales - 17.14M params, best result: 1.6780 bpb (int8+zlib) on 2000 steps batch 8K
- Add eval_val_ttt: adapts model on each val batch before evaluating - For each batch: save weights → K gradient steps → evaluate → restore - Controlled by TTT_STEPS (default 0 = disabled) and TTT_LR (default 1e-4) - Result: -0.010 bpb improvement on 200-step test (2.4124 → 2.4027) - TTT eval runs after normal roundtrip eval, reports both scores
- Sliding window eval: window=1024, stride=256, ~-0.034 bpb - forward_logits() method for sliding window support - LR x0.3: matrix=0.012, embed=0.015, scalar=0.012 (sweep winner) - GRAD_CLIP_NORM=0.3 for recurrence stability - WARMDOWN_ITERS=3000 - train@1024 (not 2048) — better for recurrence (160ms vs 253ms/step) - Fix grad_accum for non-power-of-2 GPU counts - Best result: 1.2308 bpb sliding window on 6xH100 (3726 steps)
- Fix quantization clamp_min(1/ql) -> clamp_min(1e-12) preventing broken roundtrip on undertrained models - Add Muon weight decay (0.04) for training stability - Add SWA with float32 accumulation and final snapshot inclusion - Remove sweep.sh
Improvements over previous submission (1.2196 → 1.2070, -0.014 bpb): - XSA (Exclusive Self-Attention) on last 4 effective layers: -0.010 bpb - LeakyReLU(0.5)² instead of relu²: -0.004 bpb - GPTQ-lite: per-row best-of-5 clip percentiles for quantization - zstd-22 compression instead of zlib (saves ~1.85MB artifact) - SWA tuned to frac=0.4, every=50 Tested on 8xH100, 80 train shards, PyTorch 2.5, 4290 steps.
Improvements over previous submission (1.2196 → 1.2065, -0.013 bpb): - XSA (Exclusive Self-Attention) on last 4 effective layers: -0.010 bpb - LeakyReLU(0.5)² instead of relu²: -0.004 bpb - GPTQ-lite: per-row best-of-5 clip percentiles - zstd-22 compression instead of zlib - SWA tuned to frac=0.4, every=50 8xH100, 80 train shards, 4300 steps, 140ms/step, 15.87MB artifact.
Dynamic depth scheduling unique to shared-weight recurrence: - Phase 1 (0-40%): 2 repeats, ~75ms/step — fast base training - Phase 2 (40-65%): 3 repeats, ~83ms/step — intermediate depth - Phase 3 (65-100%): 4 repeats, ~100ms/step — full recurrence 5981 steps vs 4300 without progressive depth (+39%). SWA collected only at full depth (last phase) to avoid mixing phases. Removed unused TTT eval code. 8xH100, 80 train shards, sliding 1.1973 (-0.009 vs previous 1.2065).
Progressive depth scheduling (2→3→4 repeats) unique to shared-weight recurrence. 5861 steps in 600s vs ~4300 at constant depth (+36%). Fix DDP race condition in phase switching via all_reduce sync.
Systematic tuning on 8xH100 (6 runs): - WARMDOWN_ITERS 3000→2000: full LR at phase 4 entry (-0.0009) - MATRIX/SCALAR_LR 0.012→0.018: higher LR for progressive depth (-0.0011) - Combined: val_bpb 1.1960 sliding (-0.0020 from 1.1980) Tested and rejected: schedule changes (3-phase optimal), SWA_EVERY=25, 5 repeats, GRAD_CLIP=0.5, VRL, per-repeat LoRA (artifact >16MB).
5-expert online ensemble (neural + unigram + bigram + trigram + entropy) via Hedge algorithm at eval time. -0.051 bpb over sliding window. Tuned defaults: LR=0.018, WARMDOWN=2000 (-0.002 from previous). Total improvement: 1.2244 → 1.1454 (-0.079 from baseline).
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Results
Training: 600s, 5673 steps on 8xH100. Eval (Hedge): 579s on 8xH100. Artifact: 15.88MB.
Ablation Trajectory
Test plan