Skip to content

Re-enable ringrift-p2p.service on gh200-11/12/13 after orphan-detector fix #88

Description

@an0mium

Context

Commit 0e8afe0 fixed the P2P OrphanProcessDetectionLoop bug that caused it to SIGKILL legitimate systemd services whose cmdline contained tokens like --selfplay-budget. On 2026-04-17 this manifested as:

  • ringrift-training.service on gh200-11 crash-looping (NRestarts=38)
  • ringrift-training.service on gh200-13 crash-looping (NRestarts=66)
  • ringrift-selfplay-worker.service on both nodes at NRestarts=362 (long-standing)
  • ringrift-training.service on gh200-12 surviving only because disabling P2P happened seconds before the next kill cycle

Current mitigation: ringrift-p2p.service is stopped and disabled on gh200-11, gh200-12, gh200-13. The code fix is deployed to all three nodes (commit 0e8afe0 present on disk).

Why re-enable

Per CLAUDE.md: P2P is the cluster-wide control plane for sync, health, and job coordination. The minimal training loop itself does not need P2P, but fleet-level observability (heartbeats, Elo sync, health aggregation) depends on it.

Verification protocol — one node at a time

  1. Start with gh200-13 (hex8_4p, no promotions at risk).
  2. On that node: sudo systemctl enable --now ringrift-p2p
  3. Watch SIGKILLs for 45 minutes (3x the prior 15-min kill cadence).
  4. Pass criteria:
    • No status=9/KILL in journalctl for ringrift-training.service
    • NRestarts stays unchanged
    • P2P log emits: [OrphanDetection] Skipping pid X (protected unit ringrift-training.service, ...)
  5. If clean, roll to gh200-12 (square8_3p), then gh200-11 (v5-heavy).
  6. If a kill recurs, stop and diagnose before touching other nodes.

Rollback

sudo systemctl disable --now ringrift-p2p

Blocking

None. Training fleet is stable without P2P. Do this in a dedicated window with attention.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions