Follow-up to #77 / C1 (shipped in `cb8eca501`). The ensemble-voting
infrastructure is now live but disabled by default because no
alternate checkpoints are configured.
Task
Pick 2–3 diverse checkpoints per live config, verify they load cleanly
via the runtime, and set `RINGRIFT_ENSEMBLE_EXTRA_CHECKPOINTS` on the
`ringrift-ai` PM2 process.
Candidate checkpoint sources
For each config that has real training history, the strongest ensemble
typically comes from:
- The canonical model currently in production (already the primary
— not listed in the extras env var).
- The previous canonical that was replaced by the last promotion
(often still archived on S3 or locally).
- A diversity checkpoint — a mid-training snapshot from a
different iteration, or a model trained with a different
randomness/exploration setting.
For `hex8_2p` specifically:
- Canonical: the 1979.8 Elo checkpoint currently in `models/canonical_hex8_2p.pth`
- Candidate 2: earlier promotion at iter 21 (1967.6 Elo) if archived
- Candidate 3: one of the v4 experiment promotions on gh200-8 (different architecture gives diversity even at lower Elo)
Deploy steps
- Copy selected `.pth` files to `/home/ubuntu/ringrift/ai-service/models/` on the EC2 host.
- Edit `~/ringrift/.env`:
```bash
RINGRIFT_ENSEMBLE_ENABLED=true
RINGRIFT_ENSEMBLE_EXTRA_CHECKPOINTS='{"hex8_2p": ["models/candidate_033.pth", "models/canonical_hex8_2p_v4.pth"]}'
```
- `set -a; source ~/ringrift/.env; set +a`
- `pm2 restart ringrift-ai --update-env`
- Smoke: hit a D10 hex8_2p game and check that `X-RingRift-Ensemble-Size: 3` appears on the `/ai/move` response header. The log line will include `C1 ensemble vote: size=3 agreement=... failures=0 picks=[...]`.
Rollback
Remove the env var lines, restart. Single-model path is the default.
Acceptance
Optional: calibrate the Elo gain
After a few hundred D10 games, compare per-game eval agreement
(`ensemble_agreement` low = hard positions) against outcomes; if
agreement correlates with win rate this is evidence the ensemble is
actually disagreeing on the hard positions, not just adding latency.
If you want this as part of the rollout, run `scripts/ladder_calibration.py`
(not yet written — plan item B4) against the ensemble-enabled endpoint
for a D10-vs-D9 baseline comparison.
Follow-up to #77 / C1 (shipped in `cb8eca501`). The ensemble-voting
infrastructure is now live but disabled by default because no
alternate checkpoints are configured.
Task
Pick 2–3 diverse checkpoints per live config, verify they load cleanly
via the runtime, and set `RINGRIFT_ENSEMBLE_EXTRA_CHECKPOINTS` on the
`ringrift-ai` PM2 process.
Candidate checkpoint sources
For each config that has real training history, the strongest ensemble
typically comes from:
— not listed in the extras env var).
(often still archived on S3 or locally).
different iteration, or a model trained with a different
randomness/exploration setting.
For `hex8_2p` specifically:
Deploy steps
```bash
RINGRIFT_ENSEMBLE_ENABLED=true
RINGRIFT_ENSEMBLE_EXTRA_CHECKPOINTS='{"hex8_2p": ["models/candidate_033.pth", "models/canonical_hex8_2p_v4.pth"]}'
```
Rollback
Remove the env var lines, restart. Single-model path is the default.
Acceptance
Optional: calibrate the Elo gain
After a few hundred D10 games, compare per-game eval agreement
(`ensemble_agreement` low = hard positions) against outcomes; if
agreement correlates with win rate this is evidence the ensemble is
actually disagreeing on the hard positions, not just adding latency.
If you want this as part of the rollout, run `scripts/ladder_calibration.py`
(not yet written — plan item B4) against the ensemble-enabled endpoint
for a D10-vs-D9 baseline comparison.