Context
Commit 0e8afe0 fixed the P2P OrphanProcessDetectionLoop bug that caused it to SIGKILL legitimate systemd services whose cmdline contained tokens like --selfplay-budget. On 2026-04-17 this manifested as:
- ringrift-training.service on gh200-11 crash-looping (NRestarts=38)
- ringrift-training.service on gh200-13 crash-looping (NRestarts=66)
- ringrift-selfplay-worker.service on both nodes at NRestarts=362 (long-standing)
- ringrift-training.service on gh200-12 surviving only because disabling P2P happened seconds before the next kill cycle
Current mitigation: ringrift-p2p.service is stopped and disabled on gh200-11, gh200-12, gh200-13. The code fix is deployed to all three nodes (commit 0e8afe0 present on disk).
Why re-enable
Per CLAUDE.md: P2P is the cluster-wide control plane for sync, health, and job coordination. The minimal training loop itself does not need P2P, but fleet-level observability (heartbeats, Elo sync, health aggregation) depends on it.
Verification protocol — one node at a time
- Start with gh200-13 (hex8_4p, no promotions at risk).
- On that node: sudo systemctl enable --now ringrift-p2p
- Watch SIGKILLs for 45 minutes (3x the prior 15-min kill cadence).
- Pass criteria:
- No status=9/KILL in journalctl for ringrift-training.service
- NRestarts stays unchanged
- P2P log emits: [OrphanDetection] Skipping pid X (protected unit ringrift-training.service, ...)
- If clean, roll to gh200-12 (square8_3p), then gh200-11 (v5-heavy).
- If a kill recurs, stop and diagnose before touching other nodes.
Rollback
sudo systemctl disable --now ringrift-p2p
Blocking
None. Training fleet is stable without P2P. Do this in a dedicated window with attention.
Context
Commit 0e8afe0 fixed the P2P OrphanProcessDetectionLoop bug that caused it to SIGKILL legitimate systemd services whose cmdline contained tokens like --selfplay-budget. On 2026-04-17 this manifested as:
Current mitigation: ringrift-p2p.service is stopped and disabled on gh200-11, gh200-12, gh200-13. The code fix is deployed to all three nodes (commit 0e8afe0 present on disk).
Why re-enable
Per CLAUDE.md: P2P is the cluster-wide control plane for sync, health, and job coordination. The minimal training loop itself does not need P2P, but fleet-level observability (heartbeats, Elo sync, health aggregation) depends on it.
Verification protocol — one node at a time
Rollback
sudo systemctl disable --now ringrift-p2p
Blocking
None. Training fleet is stable without P2P. Do this in a dedicated window with attention.