Skip to content

Commit 4df68ee

Browse files
committed
Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean)
Training: 600s, Eval: 507s — both within limits. 3 seeds: 1.0201, 1.0165, 1.0299 (mean 1.0222, std 0.0067)
1 parent 226d817 commit 4df68ee

6 files changed

Lines changed: 3069 additions & 0 deletions

File tree

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Record: XSA-all + VRL + CROWN-Q + Depth Recurrence + Hedge Mixer TTT
2+
3+
**val_bpb = 1.0222** (3-seed mean, std 0.0067) | **<16 MB** | 8xH100 SXM | 600s train, 507s eval
4+
5+
## 3-Seed Results (8xH100 80GB SXM, PyTorch 2.4.0+cu124)
6+
7+
| Seed | Steps | step_avg | Pre-TTT bpb | **Post-TTT bpb** | TTT time | Artifact |
8+
|------|-------|----------|-------------|-----------------|----------|----------|
9+
| 1337 | 4,473 | 134.2ms | 1.1336 | **1.0201** | 507s | 15,857,972 |
10+
| 42 | 4,452 | 134.8ms | 1.1339 | **1.0165** | 508s | 15,846,228 |
11+
| 2025 | 4,451 | 134.8ms | 1.1369 | **1.0299** | 507s | 15,669,888 |
12+
| **Mean** | | | **1.1348** | **1.0222 (std 0.0067)** | **507s** | |
13+
14+
All artifacts under 16,000,000 bytes. Training: 600s. Eval (TTT + sliding): 507s. Both within limits.
15+
16+
## Architecture: PR #549 base + 6 additions
17+
18+
### 1. XSA on all layers (PR #634)
19+
Exclusive Self-Attention on all 13 virtual layers (11 physical + 2 recurred). -0.006 BPB vs XSA-last-4.
20+
21+
### 2. Value Residual Learning (PR #657, arXiv:2410.17897)
22+
Layer 0's V output blended into subsequent attention via learned sigmoid gates. +10 params.
23+
24+
### 3. Gated Attention (PR #638)
25+
Per-head sigmoid gates on attention output.
26+
27+
### 4. CROWN-Q (PR #693)
28+
Curvature-weighted quantization penalty during warmdown: `lambda * mean(w^2) * (row_max/15)^2 / 12`. Pushes weights into flat minima for better int6 quantization. Zero eval cost.
29+
30+
### 5. Depth Recurrence (PR #686)
31+
Layers 4,5 re-executed: 11 physical layers become 13 virtual (pattern: 0,1,2,3,4,5,4,5,6,7,8,9,10). Banks indexed via v2p mapping. Untied before TTT.
32+
33+
### 6. 5-Expert Hedge Mixer (PR #688)
34+
GPU-vectorized online context mixing during TTT eval:
35+
36+
| Expert | Source |
37+
|--------|--------|
38+
| Neural | Base model log-softmax |
39+
| Unigram | Token frequency from scored tokens |
40+
| Bigram | P(next | prev) from scored tokens |
41+
| Trigram | Hashed 64K-bucket trigram table |
42+
| Entropy | Neural entropy as confidence weight |
43+
44+
Weights updated via Hedge algorithm. All n-gram tables from already-scored tokens only.
45+
46+
### Training Stack
47+
48+
11L physical (13 virtual), 512d, 8H/4KV GQA, MLP 3x LeakyReLU(0.5)^2, SmearGate, BigramHash(2048), VE128, EMA(0.997) + SWA, GPTQ-lite int6 + lzma, Muon WD=0.04, warmdown=3500.
49+
50+
### Legal Score-First TTT (1 epoch)
51+
52+
```
53+
for each 32K-token chunk:
54+
Phase 1: SCORE under torch.inference_mode() + Hedge Mixer scoring
55+
Phase 2: UPDATE mixer n-gram tables with scored tokens
56+
Phase 3: TRAIN SGD(lr=0.002, mom=0.9) on scored chunk, 1 epoch, all blocks unfrozen
57+
```
58+
59+
## Compliance
60+
61+
- [x] Training: 600s wallclock on 8xH100 SXM
62+
- [x] Eval (TTT): 507s on 8xH100 SXM
63+
- [x] All artifacts under 16,000,000 bytes
64+
- [x] Score-first TTT: tokens scored under inference_mode before training
65+
- [x] N-gram tables from already-scored tokens only
66+
- [x] No training data access during evaluation
67+
- [x] No oracle/hindsight selection
68+
- [x] GPTQ-lite: no calibration data
69+
70+
## Reproduction
71+
72+
```bash
73+
SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py
74+
```
75+
76+
All defaults match submitted results. No env vars needed.
77+
78+
## Credits
79+
80+
PR #549 (@abaybektursun), #634 (@raahilshah), #657 (@anthony-maio), #638 (@Asukabot0), #693 (@EthanYangTW), #686 (@msisovic), #688 (@RoyiRa), #493 (@parinzee), #414 (@signalrush)
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"name": "Saken Tukenov",
3+
"github_id": "stukenov",
4+
"val_bpb": 1.0222,
5+
"val_bpb_std": 0.0067,
6+
"seeds": [1337, 42, 2025],
7+
"seed_bpbs": [1.0201, 1.0165, 1.0299],
8+
"artifact_bytes_max": 15857972,
9+
"train_time_seconds": 600,
10+
"eval_time_seconds": 507,
11+
"gpu": "8xH100 SXM 80GB",
12+
"framework": "PyTorch 2.4.0",
13+
"date": "2026-03-25"
14+
}

0 commit comments

Comments
 (0)