You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The A1 seat-fairness check in scripts/lib/model_quality_gate.py::_check_seat_fairness fires false positives on normal sampling noise at stage-1 sample sizes and uses an incorrect null hypothesis. First production firing (gh200-13 iter 4 on hex8_4p) is provably noise — the same iter's selfplay block shows a larger seat asymmetry with zero model bias possible.
Evidence: first production firing was a false positive
From gh200-13 data/minimal_loop_hex8_4p/metrics.jsonl iter 4:
The check compares candidate per-seat WR against uniform (expect 25% in each seat for 4p). But the SAME iter's selfplay block shows what the game itself does when all 4 seats are played by the same model:
Ratio: 30/17 = 1.76×, higher than the eval's 1.62×. All four "players" are identical, so there is zero possible model seat-bias. The 1.76× ratio is entirely the game's natural turn-order effect (first-mover disadvantage, likely from being targeted or having less information when placing).
χ² for selfplay vs uniform: 3.92, df=3, p ≈ 0.27 — also not significant, but the max/min ratio trips 1.5 easily.
Conclusion
The current check produces false positives because:
Threshold too tight — 1.5× ratio is below the natural variance of 4-player hex8 games at n≤100, let alone at stage-1 n=50 (12-13 per seat).
Wrong null — comparing to uniform (1/N) ignores turn-order game effects. The proper null is the game's natural per-seat distribution under identical-skill play, which is already available in the same metrics row as the selfplay block.
Proposed fix
Two changes to _check_seat_fairness in scripts/lib/model_quality_gate.py:
1. Use selfplay per-seat distribution as the null (not uniform)
Extend the QualityGateTracker to accept an optional selfplay_seat_wins: dict[int, int] field (passed from minimal_alphazero_loop.py after selfplay completes, same iter). If provided, compute:
# For each seat, the game-natural expected win probability at equal skill:selfplay_total=sum(selfplay_seat_wins.values())
expected_wr= {s: w/selfplay_totalfors, winselfplay_seat_wins.items()}
# Candidate's seat WR relative to that baseline:normalized_wr= {s: seat_wr[s] /expected_wr[s] forsinseat_wr}
# Flag imbalance only if the normalized max/min ratio exceeds thresholdnormalized_ratio=max(normalized_wr.values()) /min(normalized_wr.values())
If selfplay_seat_wins is absent (first iter, old data), fall back to uniform as today but emit only an INFO note, not a warning.
2. Raise minimum-games threshold + use chi-square test
Replace the max/min ratio gate with a chi-square test against the selfplay null:
fromscipy.statsimportchisquareobserved= [seat_wins[s] forsinsorted(seat_wins)]
expected_p= [expected_wr[s] forsinsorted(seat_wins)]
expected= [p*sum(observed) forpinexpected_p]
chi2, p_value=chisquare(observed, f_exp=expected)
# Warn only if p < 0.05 AND min_games_per_seat >= 25 (for meaningful power)ifp_value<0.05andmin(seat_games.values()) >=25:
warnings.append(
f"SEAT_WR_IMBALANCE: chi2 {chi2:.2f} vs selfplay null, "f"p={p_value:.3f}, per-seat WR {seat_wr}"
)
This requires ~100 games per iter (25 per seat) before firing, which means stage 2+ data only. Effectively silences the check at stage 1.
3. Update callsites
minimal_alphazero_loop.py staged_evaluate flow: pass sp["p1_wins"]...p4_wins into the tracker so the check can use it.
minimal_alphazero_loop.py quality-gate call: collect selfplay_seat_wins from the selfplay result and pass to check_model_quality.
Baseline empirical data
ai-service/scripts/experiments/seat_fairness_baseline.py --board-type hex8 --num-players 4 --num-games 400 --seed 42 — 4 identical HeuristicAI players (no NN, no model training signal, no seat-specific behavior possible):
seat 1: 102/400 = 25.5% (95% Wilson CI [0.215, 0.300])
seat 2: 102/400 = 25.5% (95% CI [0.215, 0.300])
seat 3: 79/400 = 19.8% (95% CI [0.161, 0.239])
seat 4: 117/400 = 29.2% (95% CI [0.250, 0.339])
max/min ratio: 1.481 (right at the 1.5 threshold)
chi² vs uniform: 7.38, df=3 (critical@5%=7.82, critical@10%=6.25)
p ≈ 0.06 — borderline, not 5%-significant
Key observations:
At 400 games (10× production stage-1 sample), the ratio is 1.481 — barely under 1.5. A stage-1 n=50/iter run will cross 1.5 regularly by chance alone.
The game itself does not have uniform seat fairness. seat 3 is the weakest (19.8%), seat 4 the strongest (29.2%) — in identical-skill heuristic play. The selfplay block from iter 4 (different strategy — trained NN) shows a different asymmetry (p1=17%, p4=30%), confirming that the "game-natural" per-seat distribution depends on the player's style. So the null must be the model's own selfplay distribution, not heuristic's, not uniform.
Artifact saved at ai-service/data/seat_fairness_baseline/hex8_4p_400g_seed42.json.
Before this fix, the gate would fire on every 3p/4p iter at stage 1 regardless of actual model behavior, producing noise in downstream tools (refresh_experiment_status.extract_seat_fairness, dashboards) and potentially motivating unnecessary B3 seat-stratified-loss work.
Planned issue B3 (seat-stratified value loss) — should NOT be implemented until this fix lands and we have real data showing persistent seat bias beyond game-natural asymmetry.
Effort
~3h including tests. Safe — pure diagnostic gate, never critical, only warnings.
Summary
The A1 seat-fairness check in
scripts/lib/model_quality_gate.py::_check_seat_fairnessfires false positives on normal sampling noise at stage-1 sample sizes and uses an incorrect null hypothesis. First production firing (gh200-13 iter 4 on hex8_4p) is provably noise — the same iter's selfplay block shows a larger seat asymmetry with zero model bias possible.Evidence: first production firing was a false positive
From gh200-13
data/minimal_loop_hex8_4p/metrics.jsonliter 4:Statistical tests on this data:
Evidence: the null hypothesis is wrong
The check compares candidate per-seat WR against uniform (expect 25% in each seat for 4p). But the SAME iter's selfplay block shows what the game itself does when all 4 seats are played by the same model:
Ratio: 30/17 = 1.76×, higher than the eval's 1.62×. All four "players" are identical, so there is zero possible model seat-bias. The 1.76× ratio is entirely the game's natural turn-order effect (first-mover disadvantage, likely from being targeted or having less information when placing).
χ² for selfplay vs uniform: 3.92, df=3, p ≈ 0.27 — also not significant, but the max/min ratio trips 1.5 easily.
Conclusion
The current check produces false positives because:
Proposed fix
Two changes to
_check_seat_fairnessinscripts/lib/model_quality_gate.py:1. Use selfplay per-seat distribution as the null (not uniform)
Extend the
QualityGateTrackerto accept an optionalselfplay_seat_wins: dict[int, int]field (passed fromminimal_alphazero_loop.pyafter selfplay completes, same iter). If provided, compute:If
selfplay_seat_winsis absent (first iter, old data), fall back to uniform as today but emit only an INFO note, not a warning.2. Raise minimum-games threshold + use chi-square test
Replace the max/min ratio gate with a chi-square test against the selfplay null:
This requires ~100 games per iter (25 per seat) before firing, which means stage 2+ data only. Effectively silences the check at stage 1.
3. Update callsites
minimal_alphazero_loop.pystaged_evaluate flow: passsp["p1_wins"]...p4_winsinto the tracker so the check can use it.minimal_alphazero_loop.pyquality-gate call: collectselfplay_seat_winsfrom the selfplay result and pass tocheck_model_quality.Baseline empirical data
ai-service/scripts/experiments/seat_fairness_baseline.py --board-type hex8 --num-players 4 --num-games 400 --seed 42— 4 identical HeuristicAI players (no NN, no model training signal, no seat-specific behavior possible):Key observations:
Artifact saved at
ai-service/data/seat_fairness_baseline/hex8_4p_400g_seed42.json.Files
ai-service/scripts/lib/model_quality_gate.py:304-371—_check_seat_fairnessai-service/scripts/minimal_alphazero_loop.py— quality-gate call site, needs plumbing for selfplay_seat_winsai-service/scripts/experiments/seat_fairness_baseline.py(new) — empirical baseline scriptBlast radius
Before this fix, the gate would fire on every 3p/4p iter at stage 1 regardless of actual model behavior, producing noise in downstream tools (
refresh_experiment_status.extract_seat_fairness, dashboards) and potentially motivating unnecessary B3 seat-stratified-loss work.Related
Effort
~3h including tests. Safe — pure diagnostic gate, never critical, only warnings.