Summary
minimal_alphazero_loop.py computes Elo delta on promotion with a formula that assumes a 2-player symmetric matchup. For 3p/4p configs where staged evaluation pits 1 candidate vs (N-1) copies of best, the fair-WR null is 1/N (≈ 0.333 for 3p, 0.25 for 4p), not 0.5. The current formula gives negative Elo for promotions that squeak past the correctly-lowered 3p/4p thresholds.
Evidence
gh200-12 square8_3p iter 24 promoted at WR=0.343 (just above the ~0.33 fair baseline, so a tiny positive-strength signal). Tracked Elo:
iter 24 wr=0.343 dec=promote elo=1421.6 promos=2 ← Elo went DOWN on a promotion
Prior to iter 24, gh200-12's running Elo was 1534.9. After the "promotion" the formula charged −113.3 Elo, landing at 1421.6. A correct 3p-aware formula gives +7.1 Elo (at 0.343 WR, candidate is marginally stronger than best — small positive delta).
This hides genuine multiplayer improvement inside an Elo metric that looks like regression.
Root cause
ai-service/scripts/minimal_alphazero_loop.py promotion block:
eg = 400.0 * math.log10(wr / (1 - wr)) if 0 < wr < 1 else 0
elo += eg
This is the 2-player Elo transform: WR=0.5 ↦ 0 delta, WR=0.6 ↦ +70, WR=0.4 ↦ −70. It is correct for 2p configs where staged_evaluate is symmetric.
For Nplayer staged eval, the candidate plays 1 seat vs (N-1) best copies. Under equal skill, expected candidate WR = 1/N (by symmetry among the N players). So WR above 1/N indicates strength; WR at 1/N is even; WR below 1/N is weakness.
Proposed fix
# Number of players comes from the loop's NUM_PLAYERS (or args.num_players)
p_fair = 1.0 / NUM_PLAYERS
# Subtract the fair-baseline log-odds so WR = p_fair maps to 0 Elo delta
if 0 < wr < 1:
eg = (400.0 / math.log(10)) * (
math.log(wr / (1 - wr)) - math.log(p_fair / (1 - p_fair))
)
else:
eg = 0
elo += eg
Sanity checks on the new formula:
- 2p (
p_fair=0.5): log(0.5/0.5)=0, formula reduces to the old one. Unchanged for 2p configs.
- 3p (
p_fair=0.333): WR=0.333 → 0 delta; WR=0.5 → +120; WR=0.6 → +190; WR=0.25 → −70.
- 4p (
p_fair=0.25): WR=0.25 → 0 delta; WR=0.5 → +190; WR=0.34 → +70.
Impact
- Historical Elo numbers on 3p/4p configs are misleading downward. The square8_3p iter 24 promotion, tracked as 1534.9 → 1421.6, should be 1534.9 → ~1542 under corrected math.
- Plateau detector reads
estimated_elo — a "plateau" on 3p may be the bug making real progress look flat or negative.
- Dashboards, RESULTS.md, fleet heartbeats all propagate these wrong numbers.
Recommendation on 3p promotion threshold (separate concern raised in session)
The 3p stage-4 promote threshold was lowered from 50.1% to 34.1% to account for the 1/3 fair baseline. Question: is 34.1% too lenient (≈1% above fair-game), causing false promotions?
Recommendation: don't tighten yet. Two reasons:
- The Elo formula bug above is the more urgent fix. Once corrected Elo is in metrics, false promotions will show as small or negative corrected-Elo deltas — directly observable.
- Test the tightening decision empirically: run 3-5 more 3p iters with corrected Elo, observe whether the promotion-with-negative-corrected-Elo pattern recurs, then tighten if it does.
If corrected Elo repeatedly shows <+5 on promotions, the threshold margin above 1/3 is too thin and should be raised to something like 36-37% at stage 4. Decide with data.
Files
ai-service/scripts/minimal_alphazero_loop.py — promotion block (~line 1210) + ensure NUM_PLAYERS is accessible there (it already is via module global).
- Consider also recomputing
elo at startup from the corrected formula applied to historical metrics.jsonl promotions — prevents the old wrong values from persisting in the in-memory state after a restart.
Tests
- Existing
tests/unit/scripts/test_minimal_alphazero_loop.py should extend with a parametrized test for each of 2p/3p/4p ensuring (a) the formula reduces to the old one for 2p, (b) WR = 1/N gives 0 delta, (c) sign of delta matches sign of (WR − 1/N).
Effort
~2h including tests + retro-compute correction for in-memory Elo on startup.
Summary
minimal_alphazero_loop.pycomputes Elo delta on promotion with a formula that assumes a 2-player symmetric matchup. For 3p/4p configs where staged evaluation pits 1 candidate vs (N-1) copies of best, the fair-WR null is 1/N (≈ 0.333 for 3p, 0.25 for 4p), not 0.5. The current formula gives negative Elo for promotions that squeak past the correctly-lowered 3p/4p thresholds.Evidence
gh200-12
square8_3piter 24 promoted at WR=0.343 (just above the ~0.33 fair baseline, so a tiny positive-strength signal). Tracked Elo:Prior to iter 24, gh200-12's running Elo was 1534.9. After the "promotion" the formula charged −113.3 Elo, landing at 1421.6. A correct 3p-aware formula gives +7.1 Elo (at 0.343 WR, candidate is marginally stronger than best — small positive delta).
This hides genuine multiplayer improvement inside an Elo metric that looks like regression.
Root cause
ai-service/scripts/minimal_alphazero_loop.pypromotion block:This is the 2-player Elo transform: WR=0.5 ↦ 0 delta, WR=0.6 ↦ +70, WR=0.4 ↦ −70. It is correct for 2p configs where staged_evaluate is symmetric.
For Nplayer staged eval, the candidate plays 1 seat vs (N-1) best copies. Under equal skill, expected candidate WR = 1/N (by symmetry among the N players). So WR above 1/N indicates strength; WR at 1/N is even; WR below 1/N is weakness.
Proposed fix
Sanity checks on the new formula:
p_fair=0.5):log(0.5/0.5)=0, formula reduces to the old one. Unchanged for 2p configs.p_fair=0.333): WR=0.333 → 0 delta; WR=0.5 → +120; WR=0.6 → +190; WR=0.25 → −70.p_fair=0.25): WR=0.25 → 0 delta; WR=0.5 → +190; WR=0.34 → +70.Impact
estimated_elo— a "plateau" on 3p may be the bug making real progress look flat or negative.Recommendation on 3p promotion threshold (separate concern raised in session)
The 3p stage-4 promote threshold was lowered from 50.1% to 34.1% to account for the 1/3 fair baseline. Question: is 34.1% too lenient (≈1% above fair-game), causing false promotions?
Recommendation: don't tighten yet. Two reasons:
If corrected Elo repeatedly shows <+5 on promotions, the threshold margin above 1/3 is too thin and should be raised to something like 36-37% at stage 4. Decide with data.
Files
ai-service/scripts/minimal_alphazero_loop.py— promotion block (~line 1210) + ensureNUM_PLAYERSis accessible there (it already is via module global).eloat startup from the corrected formula applied to historical metrics.jsonl promotions — prevents the old wrong values from persisting in the in-memory state after a restart.Tests
tests/unit/scripts/test_minimal_alphazero_loop.pyshould extend with a parametrized test for each of 2p/3p/4p ensuring (a) the formula reduces to the old one for 2p, (b) WR = 1/N gives 0 delta, (c) sign of delta matches sign of (WR − 1/N).Effort
~2h including tests + retro-compute correction for in-memory Elo on startup.