Skip to content

Fix multiplayer Elo formula in minimal_alphazero_loop: 2p math applied to 3p/4p promotions #90

Description

@an0mium

Summary

minimal_alphazero_loop.py computes Elo delta on promotion with a formula that assumes a 2-player symmetric matchup. For 3p/4p configs where staged evaluation pits 1 candidate vs (N-1) copies of best, the fair-WR null is 1/N (≈ 0.333 for 3p, 0.25 for 4p), not 0.5. The current formula gives negative Elo for promotions that squeak past the correctly-lowered 3p/4p thresholds.

Evidence

gh200-12 square8_3p iter 24 promoted at WR=0.343 (just above the ~0.33 fair baseline, so a tiny positive-strength signal). Tracked Elo:

iter 24  wr=0.343  dec=promote  elo=1421.6  promos=2   ← Elo went DOWN on a promotion

Prior to iter 24, gh200-12's running Elo was 1534.9. After the "promotion" the formula charged −113.3 Elo, landing at 1421.6. A correct 3p-aware formula gives +7.1 Elo (at 0.343 WR, candidate is marginally stronger than best — small positive delta).

This hides genuine multiplayer improvement inside an Elo metric that looks like regression.

Root cause

ai-service/scripts/minimal_alphazero_loop.py promotion block:

eg = 400.0 * math.log10(wr / (1 - wr)) if 0 < wr < 1 else 0
elo += eg

This is the 2-player Elo transform: WR=0.5 ↦ 0 delta, WR=0.6 ↦ +70, WR=0.4 ↦ −70. It is correct for 2p configs where staged_evaluate is symmetric.

For Nplayer staged eval, the candidate plays 1 seat vs (N-1) best copies. Under equal skill, expected candidate WR = 1/N (by symmetry among the N players). So WR above 1/N indicates strength; WR at 1/N is even; WR below 1/N is weakness.

Proposed fix

# Number of players comes from the loop's NUM_PLAYERS (or args.num_players)
p_fair = 1.0 / NUM_PLAYERS
# Subtract the fair-baseline log-odds so WR = p_fair maps to 0 Elo delta
if 0 < wr < 1:
    eg = (400.0 / math.log(10)) * (
        math.log(wr / (1 - wr)) - math.log(p_fair / (1 - p_fair))
    )
else:
    eg = 0
elo += eg

Sanity checks on the new formula:

  • 2p (p_fair=0.5): log(0.5/0.5)=0, formula reduces to the old one. Unchanged for 2p configs.
  • 3p (p_fair=0.333): WR=0.333 → 0 delta; WR=0.5 → +120; WR=0.6 → +190; WR=0.25 → −70.
  • 4p (p_fair=0.25): WR=0.25 → 0 delta; WR=0.5 → +190; WR=0.34 → +70.

Impact

  • Historical Elo numbers on 3p/4p configs are misleading downward. The square8_3p iter 24 promotion, tracked as 1534.9 → 1421.6, should be 1534.9 → ~1542 under corrected math.
  • Plateau detector reads estimated_elo — a "plateau" on 3p may be the bug making real progress look flat or negative.
  • Dashboards, RESULTS.md, fleet heartbeats all propagate these wrong numbers.

Recommendation on 3p promotion threshold (separate concern raised in session)

The 3p stage-4 promote threshold was lowered from 50.1% to 34.1% to account for the 1/3 fair baseline. Question: is 34.1% too lenient (≈1% above fair-game), causing false promotions?

Recommendation: don't tighten yet. Two reasons:

  1. The Elo formula bug above is the more urgent fix. Once corrected Elo is in metrics, false promotions will show as small or negative corrected-Elo deltas — directly observable.
  2. Test the tightening decision empirically: run 3-5 more 3p iters with corrected Elo, observe whether the promotion-with-negative-corrected-Elo pattern recurs, then tighten if it does.

If corrected Elo repeatedly shows <+5 on promotions, the threshold margin above 1/3 is too thin and should be raised to something like 36-37% at stage 4. Decide with data.

Files

  • ai-service/scripts/minimal_alphazero_loop.py — promotion block (~line 1210) + ensure NUM_PLAYERS is accessible there (it already is via module global).
  • Consider also recomputing elo at startup from the corrected formula applied to historical metrics.jsonl promotions — prevents the old wrong values from persisting in the in-memory state after a restart.

Tests

  • Existing tests/unit/scripts/test_minimal_alphazero_loop.py should extend with a parametrized test for each of 2p/3p/4p ensuring (a) the formula reduces to the old one for 2p, (b) WR = 1/N gives 0 delta, (c) sign of delta matches sign of (WR − 1/N).

Effort

~2h including tests + retro-compute correction for in-memory Elo on startup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions