Skip to content

fix(pi): kill the whole Pi process tree on session close (F08, F07)#712

Open
SabhyaC26 wants to merge 3 commits into
mainfrom
fix/pi-orphan-process-tree-f08-f07
Open

fix(pi): kill the whole Pi process tree on session close (F08, F07)#712
SabhyaC26 wants to merge 3 commits into
mainfrom
fix/pi-orphan-process-tree-f08-f07

Conversation

@SabhyaC26

Copy link
Copy Markdown
Contributor

Summary

Fixes F08 (high) and F07 (low) from SDK_INTEGRATION_BUG_AUDIT.md.

  • F08 — orphaned Pi/node tree on close. _PiRpcSession.start() spawned Pi via _create_subprocess_exec with no start_new_session/process-group isolation, and close() only terminate()/kill()'d the single direct child PID. The spawned child is frequently the sandbox launcher wrapping the pi node CLI, which itself spawns MCP servers / shell tools — so on every close/interrupt the real pi tree was orphaned (reparented to init), leaking CPU/memory/FDs per abandoned turn. Now mirrors codex_executor: spawn with start_new_session=(os.name == "posix") and tear down with os.killpg(SIGTERM → SIGKILL) against the group, falling back to direct terminate()/kill() off-posix.
  • F07 — zombie after kill. After the SIGKILL fallback, close() never reaped the child, so it could linger as a zombie under non-default child watchers. Added a guarded await self.process.wait() after the kill (mirrors codex ~1225-1229). The await is wrapped in contextlib.suppress(Exception) so a wrong-event-loop RuntimeError (test fixtures) can't abort the rest of close().

Added import signal for the new os.killpg calls.

Test plan

  • pytest tests/inner/test_pi_executor.py -q → 113 passed
  • ruff check omnigent/inner/pi_executor.py → clean
  • No new mypy errors introduced (existing explicit-any errors are pre-existing and unrelated to these lines)

@github-actions github-actions Bot added the size/S Pull request size: S label Jun 18, 2026
@omnigent-ci

omnigent-ci Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Polly AI Review

PR Review: fix(pi): kill the whole Pi process tree on session close (F08, F07)

Reviewer: codex (cross-vendor) + polly synthesis


1. Blocking Issues

None identified. The implementation correctly satisfies both contracts:

  • start_new_session=(os.name == "posix") at spawn time ensures pgid == child pid.
  • os.killpg(pid, SIGTERM) in the graceful path and os.killpg(pid, SIGKILL) in the timeout fallback cleanly signal the whole subtree.
  • The pid is not None guard is present before both killpg calls.
  • await self.process.wait() reap is correctly placed only in the SIGKILL branch (the happy path naturally reaps via the wait_for call).
  • close_subprocess_transport / self.process = None execute unconditionally after both paths.

2. Security Analysis

No security regressions. start_new_session=True isolates the Pi process group from the parent's process group, which is strictly safer — signals sent to the server process group can no longer accidentally reach Pi children. os.killpg is scoped to the spawned child's group. The broad contextlib.suppress(PermissionError, OSError) on the POSIX kill paths is appropriate for cleanup code and does not introduce an exploitable surface.

One theoretical note: os.killpg(pid, SIGKILL) after a delay is subject to PID reuse (TOCTOU) — if the original group exited and the OS recycled the PID, a different process group could receive SIGKILL. This is a known, accepted limitation of this pattern (codex_executor has the same exposure), and the SIGTERM → wait_for(2s) → SIGKILL sequence makes the window narrow. Not actionable here.


3. Non-Blocking Suggestions

  • Redundant / overly-broad suppress on non-POSIX terminate(): contextlib.suppress(ProcessLookupError, Exception)Exception subsumes ProcessLookupError, making the listing redundant. More importantly, this is wider than the original suppress(ProcessLookupError), potentially silencing unexpected errors on Windows. Consider reverting to the narrower form for the non-POSIX path: contextlib.suppress(ProcessLookupError).

  • Silent killpg failures: PermissionError or OSError from os.killpg is silently swallowed. A logger.debug or logger.warning on the error path would make operational debugging easier without changing behavior.

  • Test coverage gap: No test asserts that a POSIX spawn sets start_new_session=True or that close() issues os.killpg rather than process.terminate(). A mock-based unit test for the group-kill path would prevent silent regression if the constant expression is ever edited.

  • pid == 0 guard: os.killpg(0, sig) would signal the orchestrator's own process group — catastrophic. subprocess.Process.pid should never be 0, but an explicit pid > 0 guard instead of pid is not None would be belt-and-suspenders.


4. Summary

This is a clean, well-scoped fix that correctly mirrors the codex_executor pattern. The core logic — new session at spawn, killpg SIGTERM/SIGKILL at teardown, guarded wait() reap — is sound and satisfies both F08 and F07. There are no blocking correctness or security issues. The main polish items are narrowing the non-POSIX suppress back to ProcessLookupError, adding a log on suppressed killpg errors, and covering the group-kill path with at least one unit test.


Automated review by Polly · workflow run

@github-actions github-actions Bot added size/M Pull request size: M and removed size/S Pull request size: S labels Jun 18, 2026
Comment thread omnigent/inner/pi_executor.py Fixed
Comment thread omnigent/inner/pi_executor.py Fixed
_PiRpcSession spawned Pi without process-group isolation and teardown
only signalled the direct child, so the sandbox launcher + pi node CLI
and any descendants it spawned (MCP servers, shell tools) were orphaned
on every close/interrupt. Mirror codex_executor: spawn with
start_new_session on posix and os.killpg(SIGTERM->SIGKILL) the group,
falling back to direct terminate()/kill() off-posix.

Also reap the killed child with await wait() so it can't linger as a
zombie under non-default child watchers (F07).
Strengthen the F08/F07 fix by falling back to direct process signals when process-group signalling fails, and add a real subprocess-tree regression test proving close() reaches descendants instead of only the launcher.
Polly/github-code-quality flagged the two empty 'except' blocks around
os.killpg(SIGTERM/SIGKILL) in _PiRpcSession.close() as silent. They are
best-effort: on failure the group_*_sent flag stays False and the
fallback terminates/kills the direct child. Document that intent.

Co-authored-by: Isaac
@SabhyaC26 SabhyaC26 force-pushed the fix/pi-orphan-process-tree-f08-f07 branch from 09cda6b to 6c411b1 Compare June 18, 2026 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Pull request size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant