Fix A1 seat-fairness check: false positive on normal sampling noise (empirical baseline)

## Summary

The A1 seat-fairness check in `scripts/lib/model_quality_gate.py::_check_seat_fairness` fires false positives on normal sampling noise at stage-1 sample sizes and uses an incorrect null hypothesis. First production firing (gh200-13 iter 4 on hex8_4p) is provably noise — the same iter's selfplay block shows a **larger** seat asymmetry with zero model bias possible.

## Evidence: first production firing was a false positive

From gh200-13 `data/minimal_loop_hex8_4p/metrics.jsonl` iter 4:

```json
"evaluation": {
    "candidate_wins": 11, "best_wins": 39,
    "win_rate": 0.22, "games_played": 50,
    "decision": "reject", "decision_stage": 1
},
"quality_gate": {
    "warnings": [
        "SEAT_WR_IMBALANCE: max/min per-seat WR ratio 1.62 > 1.5
         (seat1=15% (2/13), seat2=23% (3/13), seat3=25% (3/12), seat4=25% (3/12))"
    ],
    "details": {
        "seat_fairness": {
            "seat_wr": {"1": 0.154, "2": 0.231, "3": 0.25, "4": 0.25},
            "seat_games": {"1": 13, "2": 13, "3": 12, "4": 12},
            "wr_ratio": 1.625
        }
    }
}
```

Statistical tests on this data:

- Chi² across 4 seats vs uniform: **χ² = 0.47, df=3, p ≈ 0.93** — not remotely significant
- seat1 vs seat4 two-proportion z-test: **z = 0.60** (need \|z\|>1.96) — not significant
- 95% Wilson CIs: seat1 [0.04, 0.42], seat4 [0.09, 0.53] — overlap completely

## Evidence: the null hypothesis is wrong

The check compares candidate per-seat WR against **uniform** (expect 25% in each seat for 4p). But the SAME iter's selfplay block shows what the game itself does when all 4 seats are played by the **same model**:

```json
"selfplay": {
    "completed": 100,
    "p1_wins": 17, "p2_wins": 25, "p3_wins": 28, "p4_wins": 30
}
```

Ratio: **30/17 = 1.76×**, higher than the eval's 1.62×. All four "players" are identical, so there is zero possible model seat-bias. The 1.76× ratio is entirely the game's natural turn-order effect (first-mover disadvantage, likely from being targeted or having less information when placing).

χ² for selfplay vs uniform: 3.92, df=3, p ≈ 0.27 — **also not significant**, but the max/min ratio trips 1.5 easily.

## Conclusion

The current check produces false positives because:

1. **Threshold too tight** — 1.5× ratio is below the natural variance of 4-player hex8 games at n≤100, let alone at stage-1 n=50 (12-13 per seat).
2. **Wrong null** — comparing to uniform (1/N) ignores turn-order game effects. The proper null is the game's natural per-seat distribution under identical-skill play, which is already available **in the same metrics row** as the selfplay block.

## Proposed fix

Two changes to `_check_seat_fairness` in `scripts/lib/model_quality_gate.py`:

### 1. Use selfplay per-seat distribution as the null (not uniform)

Extend the `QualityGateTracker` to accept an optional `selfplay_seat_wins: dict[int, int]` field (passed from `minimal_alphazero_loop.py` after selfplay completes, same iter). If provided, compute:

```python
# For each seat, the game-natural expected win probability at equal skill:
selfplay_total = sum(selfplay_seat_wins.values())
expected_wr = {s: w / selfplay_total for s, w in selfplay_seat_wins.items()}

# Candidate's seat WR relative to that baseline:
normalized_wr = {s: seat_wr[s] / expected_wr[s] for s in seat_wr}

# Flag imbalance only if the normalized max/min ratio exceeds threshold
normalized_ratio = max(normalized_wr.values()) / min(normalized_wr.values())
```

If `selfplay_seat_wins` is absent (first iter, old data), fall back to uniform as today but emit only an INFO note, not a warning.

### 2. Raise minimum-games threshold + use chi-square test

Replace the max/min ratio gate with a chi-square test against the selfplay null:

```python
from scipy.stats import chisquare

observed = [seat_wins[s] for s in sorted(seat_wins)]
expected_p = [expected_wr[s] for s in sorted(seat_wins)]
expected = [p * sum(observed) for p in expected_p]
chi2, p_value = chisquare(observed, f_exp=expected)

# Warn only if p < 0.05 AND min_games_per_seat >= 25 (for meaningful power)
if p_value < 0.05 and min(seat_games.values()) >= 25:
    warnings.append(
        f"SEAT_WR_IMBALANCE: chi2 {chi2:.2f} vs selfplay null, "
        f"p={p_value:.3f}, per-seat WR {seat_wr}"
    )
```

This requires ~100 games per iter (25 per seat) before firing, which means stage 2+ data only. Effectively silences the check at stage 1.

### 3. Update callsites

- `minimal_alphazero_loop.py` staged_evaluate flow: pass `sp["p1_wins"]...p4_wins` into the tracker so the check can use it.
- `minimal_alphazero_loop.py` quality-gate call: collect `selfplay_seat_wins` from the selfplay result and pass to `check_model_quality`.

## Baseline empirical data

`ai-service/scripts/experiments/seat_fairness_baseline.py --board-type hex8 --num-players 4 --num-games 400 --seed 42` — 4 identical HeuristicAI players (no NN, no model training signal, no seat-specific behavior possible):

```
seat 1: 102/400 = 25.5%  (95% Wilson CI [0.215, 0.300])
seat 2: 102/400 = 25.5%  (95% CI [0.215, 0.300])
seat 3:  79/400 = 19.8%  (95% CI [0.161, 0.239])
seat 4: 117/400 = 29.2%  (95% CI [0.250, 0.339])

max/min ratio:   1.481  (right at the 1.5 threshold)
chi² vs uniform: 7.38, df=3  (critical@5%=7.82, critical@10%=6.25)
                 p ≈ 0.06 — borderline, not 5%-significant
```

Key observations:

- **At 400 games (10× production stage-1 sample), the ratio is 1.481 — barely under 1.5.** A stage-1 n=50/iter run will cross 1.5 regularly by chance alone.
- **The game itself does not have uniform seat fairness.** seat 3 is the weakest (19.8%), seat 4 the strongest (29.2%) — in identical-skill heuristic play. The selfplay block from iter 4 (different strategy — trained NN) shows a different asymmetry (p1=17%, p4=30%), confirming that the "game-natural" per-seat distribution **depends on the player's style**. So the null must be the model's own selfplay distribution, not heuristic's, not uniform.

Artifact saved at `ai-service/data/seat_fairness_baseline/hex8_4p_400g_seed42.json`.

## Files

- `ai-service/scripts/lib/model_quality_gate.py:304-371` — `_check_seat_fairness`
- `ai-service/scripts/minimal_alphazero_loop.py` — quality-gate call site, needs plumbing for selfplay_seat_wins
- `ai-service/scripts/experiments/seat_fairness_baseline.py` (new) — empirical baseline script

## Blast radius

Before this fix, the gate would fire on every 3p/4p iter at stage 1 regardless of actual model behavior, producing noise in downstream tools (`refresh_experiment_status.extract_seat_fairness`, dashboards) and potentially motivating unnecessary B3 seat-stratified-loss work.

## Related

- Issue #78 (A1 per-seat WR tracking) — closed, this is a follow-up to make it actually useful.
- Planned issue B3 (seat-stratified value loss) — should NOT be implemented until this fix lands and we have real data showing persistent seat bias beyond game-natural asymmetry.

## Effort

~3h including tests. Safe — pure diagnostic gate, never critical, only warnings.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix A1 seat-fairness check: false positive on normal sampling noise (empirical baseline) #89

Summary

Evidence: first production firing was a false positive

Evidence: the null hypothesis is wrong

Conclusion

Proposed fix

1. Use selfplay per-seat distribution as the null (not uniform)

2. Raise minimum-games threshold + use chi-square test

3. Update callsites

Baseline empirical data

Files

Blast radius

Related

Effort

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Fix A1 seat-fairness check: false positive on normal sampling noise (empirical baseline) #89

Description

Summary

Evidence: first production firing was a false positive

Evidence: the null hypothesis is wrong

Conclusion

Proposed fix

1. Use selfplay per-seat distribution as the null (not uniform)

2. Raise minimum-games threshold + use chi-square test

3. Update callsites

Baseline empirical data

Files

Blast radius

Related

Effort

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions