Skip to content

Fix A1 seat-fairness check: false positive on normal sampling noise (empirical baseline) #89

Description

@an0mium

Summary

The A1 seat-fairness check in scripts/lib/model_quality_gate.py::_check_seat_fairness fires false positives on normal sampling noise at stage-1 sample sizes and uses an incorrect null hypothesis. First production firing (gh200-13 iter 4 on hex8_4p) is provably noise — the same iter's selfplay block shows a larger seat asymmetry with zero model bias possible.

Evidence: first production firing was a false positive

From gh200-13 data/minimal_loop_hex8_4p/metrics.jsonl iter 4:

"evaluation": {
    "candidate_wins": 11, "best_wins": 39,
    "win_rate": 0.22, "games_played": 50,
    "decision": "reject", "decision_stage": 1
},
"quality_gate": {
    "warnings": [
        "SEAT_WR_IMBALANCE: max/min per-seat WR ratio 1.62 > 1.5
         (seat1=15% (2/13), seat2=23% (3/13), seat3=25% (3/12), seat4=25% (3/12))"
    ],
    "details": {
        "seat_fairness": {
            "seat_wr": {"1": 0.154, "2": 0.231, "3": 0.25, "4": 0.25},
            "seat_games": {"1": 13, "2": 13, "3": 12, "4": 12},
            "wr_ratio": 1.625
        }
    }
}

Statistical tests on this data:

  • Chi² across 4 seats vs uniform: χ² = 0.47, df=3, p ≈ 0.93 — not remotely significant
  • seat1 vs seat4 two-proportion z-test: z = 0.60 (need |z|>1.96) — not significant
  • 95% Wilson CIs: seat1 [0.04, 0.42], seat4 [0.09, 0.53] — overlap completely

Evidence: the null hypothesis is wrong

The check compares candidate per-seat WR against uniform (expect 25% in each seat for 4p). But the SAME iter's selfplay block shows what the game itself does when all 4 seats are played by the same model:

"selfplay": {
    "completed": 100,
    "p1_wins": 17, "p2_wins": 25, "p3_wins": 28, "p4_wins": 30
}

Ratio: 30/17 = 1.76×, higher than the eval's 1.62×. All four "players" are identical, so there is zero possible model seat-bias. The 1.76× ratio is entirely the game's natural turn-order effect (first-mover disadvantage, likely from being targeted or having less information when placing).

χ² for selfplay vs uniform: 3.92, df=3, p ≈ 0.27 — also not significant, but the max/min ratio trips 1.5 easily.

Conclusion

The current check produces false positives because:

  1. Threshold too tight — 1.5× ratio is below the natural variance of 4-player hex8 games at n≤100, let alone at stage-1 n=50 (12-13 per seat).
  2. Wrong null — comparing to uniform (1/N) ignores turn-order game effects. The proper null is the game's natural per-seat distribution under identical-skill play, which is already available in the same metrics row as the selfplay block.

Proposed fix

Two changes to _check_seat_fairness in scripts/lib/model_quality_gate.py:

1. Use selfplay per-seat distribution as the null (not uniform)

Extend the QualityGateTracker to accept an optional selfplay_seat_wins: dict[int, int] field (passed from minimal_alphazero_loop.py after selfplay completes, same iter). If provided, compute:

# For each seat, the game-natural expected win probability at equal skill:
selfplay_total = sum(selfplay_seat_wins.values())
expected_wr = {s: w / selfplay_total for s, w in selfplay_seat_wins.items()}

# Candidate's seat WR relative to that baseline:
normalized_wr = {s: seat_wr[s] / expected_wr[s] for s in seat_wr}

# Flag imbalance only if the normalized max/min ratio exceeds threshold
normalized_ratio = max(normalized_wr.values()) / min(normalized_wr.values())

If selfplay_seat_wins is absent (first iter, old data), fall back to uniform as today but emit only an INFO note, not a warning.

2. Raise minimum-games threshold + use chi-square test

Replace the max/min ratio gate with a chi-square test against the selfplay null:

from scipy.stats import chisquare

observed = [seat_wins[s] for s in sorted(seat_wins)]
expected_p = [expected_wr[s] for s in sorted(seat_wins)]
expected = [p * sum(observed) for p in expected_p]
chi2, p_value = chisquare(observed, f_exp=expected)

# Warn only if p < 0.05 AND min_games_per_seat >= 25 (for meaningful power)
if p_value < 0.05 and min(seat_games.values()) >= 25:
    warnings.append(
        f"SEAT_WR_IMBALANCE: chi2 {chi2:.2f} vs selfplay null, "
        f"p={p_value:.3f}, per-seat WR {seat_wr}"
    )

This requires ~100 games per iter (25 per seat) before firing, which means stage 2+ data only. Effectively silences the check at stage 1.

3. Update callsites

  • minimal_alphazero_loop.py staged_evaluate flow: pass sp["p1_wins"]...p4_wins into the tracker so the check can use it.
  • minimal_alphazero_loop.py quality-gate call: collect selfplay_seat_wins from the selfplay result and pass to check_model_quality.

Baseline empirical data

ai-service/scripts/experiments/seat_fairness_baseline.py --board-type hex8 --num-players 4 --num-games 400 --seed 42 — 4 identical HeuristicAI players (no NN, no model training signal, no seat-specific behavior possible):

seat 1: 102/400 = 25.5%  (95% Wilson CI [0.215, 0.300])
seat 2: 102/400 = 25.5%  (95% CI [0.215, 0.300])
seat 3:  79/400 = 19.8%  (95% CI [0.161, 0.239])
seat 4: 117/400 = 29.2%  (95% CI [0.250, 0.339])

max/min ratio:   1.481  (right at the 1.5 threshold)
chi² vs uniform: 7.38, df=3  (critical@5%=7.82, critical@10%=6.25)
                 p ≈ 0.06 — borderline, not 5%-significant

Key observations:

  • At 400 games (10× production stage-1 sample), the ratio is 1.481 — barely under 1.5. A stage-1 n=50/iter run will cross 1.5 regularly by chance alone.
  • The game itself does not have uniform seat fairness. seat 3 is the weakest (19.8%), seat 4 the strongest (29.2%) — in identical-skill heuristic play. The selfplay block from iter 4 (different strategy — trained NN) shows a different asymmetry (p1=17%, p4=30%), confirming that the "game-natural" per-seat distribution depends on the player's style. So the null must be the model's own selfplay distribution, not heuristic's, not uniform.

Artifact saved at ai-service/data/seat_fairness_baseline/hex8_4p_400g_seed42.json.

Files

  • ai-service/scripts/lib/model_quality_gate.py:304-371_check_seat_fairness
  • ai-service/scripts/minimal_alphazero_loop.py — quality-gate call site, needs plumbing for selfplay_seat_wins
  • ai-service/scripts/experiments/seat_fairness_baseline.py (new) — empirical baseline script

Blast radius

Before this fix, the gate would fire on every 3p/4p iter at stage 1 regardless of actual model behavior, producing noise in downstream tools (refresh_experiment_status.extract_seat_fairness, dashboards) and potentially motivating unnecessary B3 seat-stratified-loss work.

Related

  • Issue [A1] Add per-seat WR tracking to quality gate #78 (A1 per-seat WR tracking) — closed, this is a follow-up to make it actually useful.
  • Planned issue B3 (seat-stratified value loss) — should NOT be implemented until this fix lands and we have real data showing persistent seat bias beyond game-natural asymmetry.

Effort

~3h including tests. Safe — pure diagnostic gate, never critical, only warnings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions