|
| 1 | +# Record: XSA-all + VRL + CROWN-Q + Depth Recurrence + Hedge Mixer TTT |
| 2 | + |
| 3 | +**val_bpb = 1.0222** (3-seed mean, std 0.0067) | **<16 MB** | 8xH100 SXM | 600s train, 507s eval |
| 4 | + |
| 5 | +## 3-Seed Results (8xH100 80GB SXM, PyTorch 2.4.0+cu124) |
| 6 | + |
| 7 | +| Seed | Steps | step_avg | Pre-TTT bpb | **Post-TTT bpb** | TTT time | Artifact | |
| 8 | +|------|-------|----------|-------------|-----------------|----------|----------| |
| 9 | +| 1337 | 4,473 | 134.2ms | 1.1336 | **1.0201** | 507s | 15,857,972 | |
| 10 | +| 42 | 4,452 | 134.8ms | 1.1339 | **1.0165** | 508s | 15,846,228 | |
| 11 | +| 2025 | 4,451 | 134.8ms | 1.1369 | **1.0299** | 507s | 15,669,888 | |
| 12 | +| **Mean** | | | **1.1348** | **1.0222 (std 0.0067)** | **507s** | | |
| 13 | + |
| 14 | +All artifacts under 16,000,000 bytes. Training: 600s. Eval (TTT + sliding): 507s. Both within limits. |
| 15 | + |
| 16 | +## Architecture: PR #549 base + 6 additions |
| 17 | + |
| 18 | +### 1. XSA on all layers (PR #634) |
| 19 | +Exclusive Self-Attention on all 13 virtual layers (11 physical + 2 recurred). -0.006 BPB vs XSA-last-4. |
| 20 | + |
| 21 | +### 2. Value Residual Learning (PR #657, arXiv:2410.17897) |
| 22 | +Layer 0's V output blended into subsequent attention via learned sigmoid gates. +10 params. |
| 23 | + |
| 24 | +### 3. Gated Attention (PR #638) |
| 25 | +Per-head sigmoid gates on attention output. |
| 26 | + |
| 27 | +### 4. CROWN-Q (PR #693) |
| 28 | +Curvature-weighted quantization penalty during warmdown: `lambda * mean(w^2) * (row_max/15)^2 / 12`. Pushes weights into flat minima for better int6 quantization. Zero eval cost. |
| 29 | + |
| 30 | +### 5. Depth Recurrence (PR #686) |
| 31 | +Layers 4,5 re-executed: 11 physical layers become 13 virtual (pattern: 0,1,2,3,4,5,4,5,6,7,8,9,10). Banks indexed via v2p mapping. Untied before TTT. |
| 32 | + |
| 33 | +### 6. 5-Expert Hedge Mixer (PR #688) |
| 34 | +GPU-vectorized online context mixing during TTT eval: |
| 35 | + |
| 36 | +| Expert | Source | |
| 37 | +|--------|--------| |
| 38 | +| Neural | Base model log-softmax | |
| 39 | +| Unigram | Token frequency from scored tokens | |
| 40 | +| Bigram | P(next | prev) from scored tokens | |
| 41 | +| Trigram | Hashed 64K-bucket trigram table | |
| 42 | +| Entropy | Neural entropy as confidence weight | |
| 43 | + |
| 44 | +Weights updated via Hedge algorithm. All n-gram tables from already-scored tokens only. |
| 45 | + |
| 46 | +### Training Stack |
| 47 | + |
| 48 | +11L physical (13 virtual), 512d, 8H/4KV GQA, MLP 3x LeakyReLU(0.5)^2, SmearGate, BigramHash(2048), VE128, EMA(0.997) + SWA, GPTQ-lite int6 + lzma, Muon WD=0.04, warmdown=3500. |
| 49 | + |
| 50 | +### Legal Score-First TTT (1 epoch) |
| 51 | + |
| 52 | +``` |
| 53 | +for each 32K-token chunk: |
| 54 | + Phase 1: SCORE under torch.inference_mode() + Hedge Mixer scoring |
| 55 | + Phase 2: UPDATE mixer n-gram tables with scored tokens |
| 56 | + Phase 3: TRAIN SGD(lr=0.002, mom=0.9) on scored chunk, 1 epoch, all blocks unfrozen |
| 57 | +``` |
| 58 | + |
| 59 | +## Compliance |
| 60 | + |
| 61 | +- [x] Training: 600s wallclock on 8xH100 SXM |
| 62 | +- [x] Eval (TTT): 507s on 8xH100 SXM |
| 63 | +- [x] All artifacts under 16,000,000 bytes |
| 64 | +- [x] Score-first TTT: tokens scored under inference_mode before training |
| 65 | +- [x] N-gram tables from already-scored tokens only |
| 66 | +- [x] No training data access during evaluation |
| 67 | +- [x] No oracle/hindsight selection |
| 68 | +- [x] GPTQ-lite: no calibration data |
| 69 | + |
| 70 | +## Reproduction |
| 71 | + |
| 72 | +```bash |
| 73 | +SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 74 | +``` |
| 75 | + |
| 76 | +All defaults match submitted results. No env vars needed. |
| 77 | + |
| 78 | +## Credits |
| 79 | + |
| 80 | +PR #549 (@abaybektursun), #634 (@raahilshah), #657 (@anthony-maio), #638 (@Asukabot0), #693 (@EthanYangTW), #686 (@msisovic), #688 (@RoyiRa), #493 (@parinzee), #414 (@signalrush) |
0 commit comments