Skip to content

[C1 follow-up] Configure ensemble checkpoints for D9/D10 on production #87

Description

@an0mium

Follow-up to #77 / C1 (shipped in `cb8eca501`). The ensemble-voting
infrastructure is now live but disabled by default because no
alternate checkpoints are configured.

Task

Pick 2–3 diverse checkpoints per live config, verify they load cleanly
via the runtime, and set `RINGRIFT_ENSEMBLE_EXTRA_CHECKPOINTS` on the
`ringrift-ai` PM2 process.

Candidate checkpoint sources

For each config that has real training history, the strongest ensemble
typically comes from:

  1. The canonical model currently in production (already the primary
    — not listed in the extras env var).
  2. The previous canonical that was replaced by the last promotion
    (often still archived on S3 or locally).
  3. A diversity checkpoint — a mid-training snapshot from a
    different iteration, or a model trained with a different
    randomness/exploration setting.

For `hex8_2p` specifically:

  • Canonical: the 1979.8 Elo checkpoint currently in `models/canonical_hex8_2p.pth`
  • Candidate 2: earlier promotion at iter 21 (1967.6 Elo) if archived
  • Candidate 3: one of the v4 experiment promotions on gh200-8 (different architecture gives diversity even at lower Elo)

Deploy steps

  1. Copy selected `.pth` files to `/home/ubuntu/ringrift/ai-service/models/` on the EC2 host.
  2. Edit `~/ringrift/.env`:
    ```bash
    RINGRIFT_ENSEMBLE_ENABLED=true
    RINGRIFT_ENSEMBLE_EXTRA_CHECKPOINTS='{"hex8_2p": ["models/candidate_033.pth", "models/canonical_hex8_2p_v4.pth"]}'
    ```
  3. `set -a; source ~/ringrift/.env; set +a`
  4. `pm2 restart ringrift-ai --update-env`
  5. Smoke: hit a D10 hex8_2p game and check that `X-RingRift-Ensemble-Size: 3` appears on the `/ai/move` response header. The log line will include `C1 ensemble vote: size=3 agreement=... failures=0 picks=[...]`.

Rollback

Remove the env var lines, restart. Single-model path is the default.

Acceptance

  • D10 hex8_2p move requests show `ensemble_size` in the response body
  • `X-RingRift-Ensemble-Size` header present on those responses
  • Log line `C1 ensemble vote: ...` fires at least once per D10 request
  • No regression in single-model behaviour for D1-D8 tiers
  • No regression in p95 latency (ensemble runs concurrently; total wall-clock should be within 20% of single-model baseline)

Optional: calibrate the Elo gain

After a few hundred D10 games, compare per-game eval agreement
(`ensemble_agreement` low = hard positions) against outcomes; if
agreement correlates with win rate this is evidence the ensemble is
actually disagreeing on the hard positions, not just adding latency.

If you want this as part of the rollout, run `scripts/ladder_calibration.py`
(not yet written — plan item B4) against the ensemble-enabled endpoint
for a D10-vs-D9 baseline comparison.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions