[C1 follow-up] Configure ensemble checkpoints for D9/D10 on production

Follow-up to #77 / C1 (shipped in \`cb8eca501\`).  The ensemble-voting
infrastructure is now live but disabled by default because no
alternate checkpoints are configured.

## Task

Pick 2–3 diverse checkpoints per live config, verify they load cleanly
via the runtime, and set \`RINGRIFT_ENSEMBLE_EXTRA_CHECKPOINTS\` on the
\`ringrift-ai\` PM2 process.

## Candidate checkpoint sources

For each config that has real training history, the strongest ensemble
typically comes from:

1. The **canonical** model currently in production (already the primary
   — not listed in the extras env var).
2. The **previous canonical** that was replaced by the last promotion
   (often still archived on S3 or locally).
3. A **diversity checkpoint** — a mid-training snapshot from a
   different iteration, or a model trained with a different
   randomness/exploration setting.

For \`hex8_2p\` specifically:

- Canonical: the 1979.8 Elo checkpoint currently in \`models/canonical_hex8_2p.pth\`
- Candidate 2: earlier promotion at iter 21 (1967.6 Elo) if archived
- Candidate 3: one of the v4 experiment promotions on gh200-8 (different architecture gives diversity even at lower Elo)

## Deploy steps

1. Copy selected \`.pth\` files to \`/home/ubuntu/ringrift/ai-service/models/\` on the EC2 host.
2. Edit \`~/ringrift/.env\`:
   \`\`\`bash
   RINGRIFT_ENSEMBLE_ENABLED=true
   RINGRIFT_ENSEMBLE_EXTRA_CHECKPOINTS='{"hex8_2p": ["models/candidate_033.pth", "models/canonical_hex8_2p_v4.pth"]}'
   \`\`\`
3. \`set -a; source ~/ringrift/.env; set +a\`
4. \`pm2 restart ringrift-ai --update-env\`
5. Smoke: hit a D10 hex8_2p game and check that \`X-RingRift-Ensemble-Size: 3\` appears on the \`/ai/move\` response header. The log line will include \`C1 ensemble vote: size=3 agreement=... failures=0 picks=[...]\`.

## Rollback

Remove the env var lines, restart. Single-model path is the default.

## Acceptance

- [ ] D10 hex8_2p move requests show \`ensemble_size\` in the response body
- [ ] \`X-RingRift-Ensemble-Size\` header present on those responses
- [ ] Log line \`C1 ensemble vote: ...\` fires at least once per D10 request
- [ ] No regression in single-model behaviour for D1-D8 tiers
- [ ] No regression in p95 latency (ensemble runs concurrently; total wall-clock should be within 20% of single-model baseline)

## Optional: calibrate the Elo gain

After a few hundred D10 games, compare per-game eval agreement
(\`ensemble_agreement\` low = hard positions) against outcomes; if
agreement correlates with win rate this is evidence the ensemble is
actually disagreeing on the hard positions, not just adding latency.

If you want this as part of the rollout, run \`scripts/ladder_calibration.py\`
(not yet written — plan item B4) against the ensemble-enabled endpoint
for a D10-vs-D9 baseline comparison.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C1 follow-up] Configure ensemble checkpoints for D9/D10 on production #87

Task

Candidate checkpoint sources

Deploy steps

Rollback

Acceptance

Optional: calibrate the Elo gain

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[C1 follow-up] Configure ensemble checkpoints for D9/D10 on production #87

Description

Task

Candidate checkpoint sources

Deploy steps

Rollback

Acceptance

Optional: calibrate the Elo gain

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions