Fix multiplayer Elo formula in minimal_alphazero_loop: 2p math applied to 3p/4p promotions

## Summary

`minimal_alphazero_loop.py` computes Elo delta on promotion with a formula that assumes a 2-player symmetric matchup. For 3p/4p configs where staged evaluation pits 1 candidate vs (N-1) copies of best, the fair-WR null is 1/N (≈ 0.333 for 3p, 0.25 for 4p), not 0.5. The current formula gives **negative Elo** for promotions that squeak past the correctly-lowered 3p/4p thresholds.

## Evidence

gh200-12 `square8_3p` iter 24 promoted at WR=0.343 (just above the ~0.33 fair baseline, so a tiny positive-strength signal). Tracked Elo:

```
iter 24  wr=0.343  dec=promote  elo=1421.6  promos=2   ← Elo went DOWN on a promotion
```

Prior to iter 24, gh200-12's running Elo was 1534.9. After the "promotion" the formula charged −113.3 Elo, landing at 1421.6. A correct 3p-aware formula gives +7.1 Elo (at 0.343 WR, candidate is marginally stronger than best — small positive delta).

This hides genuine multiplayer improvement inside an Elo metric that looks like regression.

## Root cause

`ai-service/scripts/minimal_alphazero_loop.py` promotion block:

```python
eg = 400.0 * math.log10(wr / (1 - wr)) if 0 < wr < 1 else 0
elo += eg
```

This is the 2-player Elo transform: WR=0.5 ↦ 0 delta, WR=0.6 ↦ +70, WR=0.4 ↦ −70. It is correct for 2p configs where staged_evaluate is symmetric.

For Nplayer staged eval, the candidate plays 1 seat vs (N-1) best copies. Under equal skill, expected candidate WR = 1/N (by symmetry among the N players). So WR above 1/N indicates strength; WR at 1/N is even; WR below 1/N is weakness.

## Proposed fix

```python
# Number of players comes from the loop's NUM_PLAYERS (or args.num_players)
p_fair = 1.0 / NUM_PLAYERS
# Subtract the fair-baseline log-odds so WR = p_fair maps to 0 Elo delta
if 0 < wr < 1:
    eg = (400.0 / math.log(10)) * (
        math.log(wr / (1 - wr)) - math.log(p_fair / (1 - p_fair))
    )
else:
    eg = 0
elo += eg
```

Sanity checks on the new formula:

- 2p (`p_fair=0.5`): `log(0.5/0.5)=0`, formula reduces to the old one. Unchanged for 2p configs.
- 3p (`p_fair=0.333`): WR=0.333 → 0 delta; WR=0.5 → +120; WR=0.6 → +190; WR=0.25 → −70.
- 4p (`p_fair=0.25`): WR=0.25 → 0 delta; WR=0.5 → +190; WR=0.34 → +70.

## Impact

- **Historical Elo numbers on 3p/4p configs are misleading downward.** The square8_3p iter 24 promotion, tracked as 1534.9 → 1421.6, should be 1534.9 → ~1542 under corrected math.
- **Plateau detector** reads `estimated_elo` — a "plateau" on 3p may be the bug making real progress look flat or negative.
- **Dashboards, RESULTS.md, fleet heartbeats** all propagate these wrong numbers.

## Recommendation on 3p promotion threshold (separate concern raised in session)

The 3p stage-4 promote threshold was lowered from 50.1% to 34.1% to account for the 1/3 fair baseline. Question: is 34.1% too lenient (≈1% above fair-game), causing false promotions?

Recommendation: **don't tighten yet.** Two reasons:

1. The Elo formula bug above is the more urgent fix. Once corrected Elo is in metrics, false promotions will show as small or negative corrected-Elo deltas — directly observable.
2. Test the tightening decision empirically: run 3-5 more 3p iters with corrected Elo, observe whether the promotion-with-negative-corrected-Elo pattern recurs, then tighten if it does.

If corrected Elo repeatedly shows <+5 on promotions, the threshold margin above 1/3 is too thin and should be raised to something like 36-37% at stage 4. Decide with data.

## Files

- `ai-service/scripts/minimal_alphazero_loop.py` — promotion block (~line 1210) + ensure `NUM_PLAYERS` is accessible there (it already is via module global).
- Consider also recomputing `elo` at startup from the corrected formula applied to historical metrics.jsonl promotions — prevents the old wrong values from persisting in the in-memory state after a restart.

## Tests

- Existing `tests/unit/scripts/test_minimal_alphazero_loop.py` should extend with a parametrized test for each of 2p/3p/4p ensuring (a) the formula reduces to the old one for 2p, (b) WR = 1/N gives 0 delta, (c) sign of delta matches sign of (WR − 1/N).

## Effort

~2h including tests + retro-compute correction for in-memory Elo on startup.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix multiplayer Elo formula in minimal_alphazero_loop: 2p math applied to 3p/4p promotions #90

Summary

Evidence

Root cause

Proposed fix

Impact

Recommendation on 3p promotion threshold (separate concern raised in session)

Files

Tests

Effort

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Fix multiplayer Elo formula in minimal_alphazero_loop: 2p math applied to 3p/4p promotions #90

Description

Summary

Evidence

Root cause

Proposed fix

Impact

Recommendation on 3p promotion threshold (separate concern raised in session)

Files

Tests

Effort

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions