Recover from opencode stalls and control-plane timeouts#66
Conversation
📝 WalkthroughWalkthroughAdds end-to-end resilience for ChangesResume Server-Death Resilience
Estimated code review effort: 5 (Critical) | ~120 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Coverage Report
Generated by pytest-cov on |
There was a problem hiding this comment.
Pull request overview
This PR improves CodeCome’s resilience to opencode server crashes, control-plane timeouts, and hung busy turns by introducing structured run statuses, better SSE liveness handling, and automated server restart/retry flows (including resuming Phase 1 from the failed subphase only).
Changes:
- Adds
RunStatuscodes and propagatesserver_unreachable/session_stalledas first-class recoverable conditions throughout the runner/harness/Phase 1 orchestration. - Hardens SSE consumption so long busy turns aren’t abandoned (reconnect past budget while busy+alive) while still bounding genuine hangs (stall watchdog + read-tick inactivity).
- Improves operational diagnostics and workspace hygiene (server exit-code logging; Erlang crash dump redirected into ignored
tmp/; documents new env knobs).
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/phases/completion.py | Adds resume opener messaging for server_unreachable and session_stalled. |
| tools/opencode/serve.py | Adds process-handle liveness checks, restart(), and a daemon exit-code log monitor. |
| tools/events/sse_client.py | Adds read-tick inactivity handling and should_continue to avoid abandoning busy turns. |
| tools/events/phase_loop.py | Adds busy/idle tracking, stall watchdog, status-probe idle recovery, and stall propagation in RunResult. |
| tools/events/base.py | Adds best-effort /session/status probe helper (absence ⇒ idle). |
| tools/codecome/status.py | Introduces shared RunStatus IntEnum and normalization helper. |
| tools/codecome/session.py | Adds structured OpenCodeRequestError, retries for session creation, and health probing. |
| tools/codecome/runner.py | Threads liveness checks through resume wait + event consumption; maps retriable HTTP failures to recoverable outcomes. |
| tools/codecome/phase_1.py | Returns structured Phase 1 outcome and enables retry-from-failed-subphase after server restart. |
| tools/codecome/harness.py | Centralizes restart+retry logic for recoverable conditions; adds restart budgets for Phase 1 and phases 2–6. |
| tests/test_sse_busy_resilience.py | Adds coverage for reconnect budgeting, read-tick behavior, stall watchdog, and idle recovery sync. |
| tests/test_session.py | Adds coverage for create-session retry behavior and retriable request errors. |
| tests/test_sandbox_bootstrap.py | Verifies Erlang crash dump is redirected to ignored tmp/. |
| tests/test_runner_resume_health.py | Tests resume readiness logic with health/liveness probes and stall propagation mapping. |
| tests/test_harness_recovery_restart.py | Tests harness restart/retry behavior and shared restart budget exhaustion. |
| tests/test_codecome_runner.py | Updates runner tests for recoverable prompt/session failures using OpenCodeRequestError. |
| templates/sandboxes/erlang-otp/docker-compose.yml | Sets ERL_CRASH_DUMP to /workspace/tmp/erl_crash.dump. |
| README.md | Documents new resilience/recovery environment variables and behavior. |
| .project/resume-server-death-resilience-plan.md | Adds implementation plan/design notes for the resilience work. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile SummaryThis PR adds structured resilience and recovery for the opencode server/session lifecycle, addressing a series of real production failures (dead server, stalled model turn, silent-but-open SSE stream, and control-plane timeouts). The implementation is carefully layered across multiple fixes with detailed design documentation.
Confidence Score: 5/5The changes are well-contained resilience improvements with no regression risk to the happy path; every new branch is covered by unit tests. The new stall watchdog, server-death disambiguation, and Phase 1 restart-at-subphase logic are all thoroughly unit-tested. Exception hierarchy ordering (subclass before parent) is correct, shared state (counters, step_finish_count) is properly initialised, and the liveness-check closure correctly captures the runner by reference so post-restart process handles are picked up automatically. The only findings are two cosmetic style nits. No files require special attention; all core orchestration paths are covered by tests. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant H as harness.run_phase_mode
participant P1 as run_phase_1
participant SP as _run_subphase
participant R as _run_single_attempt
participant PEL as PhaseEventLoop
participant SSE as SseClient
participant SRV as opencode serve
H->>P1: "run_phase_1(start_at="1a")"
P1->>SP: _run_subphase("1a")
SP->>R: _run_single_attempt(liveness_check)
R->>SRV: create_session / send_prompt
SRV-->>SSE: SSE /event stream
SSE-->>PEL: events() [with should_continue hook]
Note over PEL,SSE: Silent stream → read-tick (None) every 10s
PEL->>PEL: _should_keep_consuming()
alt Session stalled (180s no progress)
PEL->>SRV: GET /session/status (probe)
alt "Status = idle"
PEL->>PEL: "_session_idle_via_status=True"
PEL-->>R: "RunResult(stalled=False)"
else "Status = busy / None"
PEL->>PEL: "_session_stalled=True"
PEL-->>R: "RunResult(session_stalled=True)"
R-->>SP: (OK, ses, stalled_result)
SP-->>P1: RunStatus.SESSION_STALLED, "1b"
P1-->>H: Phase1Outcome(SESSION_STALLED, "1b")
H->>SRV: runner.restart()
H->>P1: "run_phase_1(start_at="1b")"
end
else Process exited (liveness_check → False)
PEL->>PEL: "_server_unreachable=True"
PEL-->>R: "RunResult(finish="server_unreachable")"
R-->>SP: (INCOMPLETE, ses, unreachable)
SP-->>P1: RunStatus.SERVER_UNREACHABLE, "1b"
P1-->>H: Phase1Outcome(SERVER_UNREACHABLE, "1b")
H->>SRV: runner.restart()
H->>P1: "run_phase_1(start_at="1b")"
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant H as harness.run_phase_mode
participant P1 as run_phase_1
participant SP as _run_subphase
participant R as _run_single_attempt
participant PEL as PhaseEventLoop
participant SSE as SseClient
participant SRV as opencode serve
H->>P1: "run_phase_1(start_at="1a")"
P1->>SP: _run_subphase("1a")
SP->>R: _run_single_attempt(liveness_check)
R->>SRV: create_session / send_prompt
SRV-->>SSE: SSE /event stream
SSE-->>PEL: events() [with should_continue hook]
Note over PEL,SSE: Silent stream → read-tick (None) every 10s
PEL->>PEL: _should_keep_consuming()
alt Session stalled (180s no progress)
PEL->>SRV: GET /session/status (probe)
alt "Status = idle"
PEL->>PEL: "_session_idle_via_status=True"
PEL-->>R: "RunResult(stalled=False)"
else "Status = busy / None"
PEL->>PEL: "_session_stalled=True"
PEL-->>R: "RunResult(session_stalled=True)"
R-->>SP: (OK, ses, stalled_result)
SP-->>P1: RunStatus.SESSION_STALLED, "1b"
P1-->>H: Phase1Outcome(SESSION_STALLED, "1b")
H->>SRV: runner.restart()
H->>P1: "run_phase_1(start_at="1b")"
end
else Process exited (liveness_check → False)
PEL->>PEL: "_server_unreachable=True"
PEL-->>R: "RunResult(finish="server_unreachable")"
R-->>SP: (INCOMPLETE, ses, unreachable)
SP-->>P1: RunStatus.SERVER_UNREACHABLE, "1b"
P1-->>H: Phase1Outcome(SERVER_UNREACHABLE, "1b")
H->>SRV: runner.restart()
H->>P1: "run_phase_1(start_at="1b")"
end
Reviews (2): Last reviewed commit: "fix: address resilience review feedback" | Re-trigger Greptile |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/codecome/harness.py`:
- Around line 328-362: The stalled-recovery path in the harness can break out
with a success status when the restart budget is exhausted, because
`_recovery_reason` from `session_stalled` does not force a non-success result.
Update the recovery handling in `harness.py` so that the `session_stalled`
branch sets `returncode`/final status to a failure state before the
budget-exhaustion `break`, and ensure the footer/exit logic driven by
`finish_warning`, `last_finish_reason`, and `RunStatus` cannot report success in
this case. Add a regression around the restart loop in the main run path that
exhausts the budget using only `session_stalled` outcomes and verifies a
non-zero exit.
In `@tools/codecome/runner.py`:
- Around line 297-308: Returning RunStatus.INCOMPLETE from the
OpenCodeRequestError retriable path can happen while the SSE consumer thread is
still alive, which lets the abandoned session keep running as the harness
retries. Update the retry/error handling in runner.py around the
consumer.join(timeout=5.0) and the OpenCodeRequestError branch so that a
timed-out or still-active consumer is first stopped/drained (or treated as a
terminal failure for this attempt) before any recoverable result is returned.
Use the existing _record_codecome_event and RunStatus/RunResult flow, but ensure
this path does not hand back a fresh retryable state until the consumer has
fully exited.
In `@tools/codecome/session.py`:
- Around line 225-228: Close the HTTP response in create_session() when calling
urllib.request.urlopen so retries don’t leak sockets. Update the request
handling in session.py to use the same context-manager pattern already used by
the status/health helpers, ensuring the response object is always closed after
reading and parsing the JSON.
In `@tools/events/phase_loop.py`:
- Around line 206-207: The dead-process path in phase_loop should be surfaced as
a recoverable outcome instead of falling through as a generic incomplete run.
Update the liveness_check() failure handling in phase_loop.py so a false result
sets a distinct finish reason such as server_unreachable and/or session_stalled
using the existing _last_finish_reason flow, then ensure the runner maps that
finish reason into the existing RunStatus.INCOMPLETE / SERVER_UNREACHABLE
recovery branch before harness branching. Use the phase_loop loop logic and the
runner’s finish-reason/status conversion code to keep recovery behavior
consistent.
- Around line 68-75: The stall timeout environment parsing in PhaseEventLoop
should be made resilient so invalid values do not raise during construction.
Update the initialization logic around the _stall_timeout_s and
_heartbeat_stall_timeout_s assignments in PhaseEventLoop to parse with a safe
fallback, matching the behavior used for CODECOME_RESUME_PROBE_TIMEOUT. If
either CODECOME_BUSY_STALL_TIMEOUT or CODECOME_HEARTBEAT_STALL_TIMEOUT is
missing or malformed, keep the default timeout value instead of letting float()
fail.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: e79ed5d1-cadf-4f73-989e-1fc313b505ca
📒 Files selected for processing (19)
.project/resume-server-death-resilience-plan.mdREADME.mdtemplates/sandboxes/erlang-otp/docker-compose.ymltests/test_codecome_runner.pytests/test_harness_recovery_restart.pytests/test_runner_resume_health.pytests/test_sandbox_bootstrap.pytests/test_session.pytests/test_sse_busy_resilience.pytools/codecome/harness.pytools/codecome/phase_1.pytools/codecome/runner.pytools/codecome/session.pytools/codecome/status.pytools/events/base.pytools/events/phase_loop.pytools/events/sse_client.pytools/opencode/serve.pytools/phases/completion.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/test_codecome_runner.py (1)
197-243: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winAdd a local timeout guard to this stalled-consumer test. There’s no suite-wide timeout configured, so this case can hang if the consumer-stop path regresses.
🕐 Suggested defensive timeout guard
+@pytest.mark.timeout(10) def test_run_single_attempt_prompt_timeout_requires_consumer_exit(mock_args, mock_console, monkeypatch):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_codecome_runner.py` around lines 197 - 243, Add a local timeout guard around the stalled-consumer case in test_run_single_attempt_prompt_timeout_requires_consumer_exit so the test cannot hang if _run_single_attempt or the consumer-stop path regresses. Keep the existing assertions, but wrap the call/behavior under test with a bounded timeout using the test framework’s timeout mechanism or a small helper, and make sure the guard is scoped to this test only.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/test_codecome_runner.py`:
- Around line 197-243: Add a local timeout guard around the stalled-consumer
case in test_run_single_attempt_prompt_timeout_requires_consumer_exit so the
test cannot hang if _run_single_attempt or the consumer-stop path regresses.
Keep the existing assertions, but wrap the call/behavior under test with a
bounded timeout using the test framework’s timeout mechanism or a small helper,
and make sure the guard is scoped to this test only.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: e4e248d5-be3f-4bfb-b20c-f174f01b96f3
📒 Files selected for processing (10)
tests/test_codecome_runner.pytests/test_harness_recovery_restart.pytests/test_runner_resume_health.pytests/test_session.pytests/test_sse_busy_resilience.pytools/codecome/harness.pytools/codecome/runner.pytools/codecome/session.pytools/events/phase_loop.pytools/events/sse_client.py
🚧 Files skipped from review as they are similar to previous changes (6)
- tests/test_harness_recovery_restart.py
- tests/test_runner_resume_health.py
- tools/events/phase_loop.py
- tools/events/sse_client.py
- tools/codecome/runner.py
- tools/codecome/harness.py
Summary
Validation
Summary by CodeRabbit