feat(agents): migrate the remediation_executor to Pydantic AI (PR #2 of 3) by galanko · Pull Request #251 · cliff-security/cliff

galanko · 2026-06-01T13:05:03Z

User description

PR #2 of three executing IMPL-0022 against ADR-0047. Moves the remediation_executor — the one tool-using agent — off the OpenCode subprocess and onto in-process Pydantic AI, with a durable human-in-the-loop approval cycle. PR #1 (#221 + #249) migrated the six no-tools agents; PR #3 deletes the leftover OpenCode engine/templates + the app-level normalizer.

Net: +2,718 / −2,127 — a ~2,400-line deletion of subprocess / SSE / permission-queue machinery, replaced by ~900 lines of in-process tools + a DB-marker park/resume cycle. Exactly ADR-0047's "simpler" thesis.

What changed (by slice)

Slice	Detail
A — tool primitives	`backend/cliff/agents/runtime/tools/` — `bash`, `edit`, `read`, `webfetch`, `gh` as in-process `@agent.tool` functions, plus `permissions.py` (the auto/ask/deny tier classifier ported verbatim from the OpenCode-era `_classify_tool_request` + denylists). `gate_tool_call` maps a tier to PA's HITL vocabulary: `deny`→`ModelRetry` (model pivots), `ask`→`ApprovalRequired` until approved, `auto`→run.
B — executor agent	`runtime/remediation_executor.py` — `Agent(output_type=[RemediationExecutorOutput, DeferredToolRequests], tools=…, toolsets=<mcp>)`. SYSTEM_PROMPT lifted from the Jinja template (all seven hard rules preserved verbatim — each = a real shipped regression). `runtime/tools/mcp.py` adapts Cliff's resolved MCP configs to PA toolsets.
C — dispatch + HITL	`executor.execute()` routes `remediation_executor` through `_run_pa_executor`. A gated tool → `DeferredToolRequests` output → migration 023 persists `permission_request` + `pa_message_history` on the agent_run row, returns the new `awaiting_permission` status. `POST .../permission` resumes via `resume_executor` (rebuild agent → deserialize history → `ToolApproved`/`ToolDenied` → re-enter or re-park). Shared `_finalize_run` / `_resolve_active_model` / `_run_pa_call` extracted so initial-run and resume reach identical persistence + retry.
D — delete frontend SSE	The `/agent-execution/stream` route + the `IssueSidePanel` EventSource were only ever a refetch nudge over the polled `agent-runs` query (which is the source of truth and renders the prompt from the durable marker). Both deleted; the frontend relies on the 3s/5s poll.
E — delete OpenCode machinery	Removed `_send_and_collect`, `_collect_response`, `approve_tool`/`deny_tool`/`push_permission_event`/`ensure_permission_queue`/`_patch_permission_marker`, `_PendingApproval`, the dead string-based rate-limit classifier, and the in-memory permission queue. `_active_runs`/`check_not_busy` stay (run-tracking).

What did NOT change

The six no-tools agents (PR Add branch protection rules and Git workflow guidelines #1) and their persistence path.
The tier-classification policy — ported verbatim; only the tier→exception translation is new (ModelRetry/ApprovalRequired).
repo_workspace_runner.py's own _send_and_collect (the repo-action generators) and _STRUCTURED_OUTPUT_CONTRACTS / build_agent_prompt — OpenCode prompt machinery left for PR Switch license from MIT to AGPL-3.0 for open-core model #3.

Deviations from IMPL-0022 (verified against PA runtime)

deny → ModelRetry, not ValueError. PA propagates a raw tool ValueError (crashes the run); ModelRetry is what PA converts to a model-visible retry — matching ADR-0047's stated "the model gets a deterministic error and pivots." (IMPL text wants a one-line update.)
Frontend SSE deleted, not retargeted. The IMPL's conservative "keep the wire, swap the producer" was over-built — the channel carried no data post-PA. Deleting it is the bigger, cleaner win.

Test plan

uv run ruff check cliff/ tests/ — clean
Full backend non-e2e suite — 1570 passed / 14 skipped (the only failure is the pre-existing wiz-s3-public LLM-eval flake, untouched by this branch)
New tests: test_deferred_tools_persist.py (real FunctionModel park→resume + corrupt-state→failed), test_remediation_executor.py (defer/deny/complete), test_executor_provider_chain.py (6 providers), tools/ (44), rate-limit retargeted to the PA _run_pa_call path
OpenAPI snapshot regenerated (drops the stream path, adds optional deny_message)
Live smoke on the installed build — migration 023 applied, PA executor path live, OpenCode/SSE machinery gone; cliffsec selftest ran the full no-tools pipeline in-process to the plan gate, no regression. Transcript: cliff-os/docs/qa/migration-smoke/PR2-executor/
/code-review (recall, 7 angles) — one confirmed reliability gap fixed (resume_executor no longer wedges a run on a corrupt pa_message_history); two findings refuted (DB lifetime, edit-approve semantics); cleanup deferred to PR Switch license from MIT to AGPL-3.0 for open-core model #3
Live HITL smoke docs: restructure roadmap into parallel stages #4 (rm -rf approval) + feat: Phase 3 — Persistence & Domain Model #5 (validator retry) — blocked on a GitHub re-auth (expired token → executor can't open a PR, independent of the migration). Park/resume is unit-covered; best finished interactively post-merge.

Skipped tests (un-skipped/removed in PR #3)

13 OpenCode-machinery tests carry explicit PR2.E skip reasons (same staged pattern PR #1 used). Their behaviour is replaced by test_deferred_tools_persist.py + the route tests; the dead machinery they exercise is deleted in PR #3.

Stacked PRs

PR Add branch protection rules and Git workflow guidelines #1 — six no-tools agents (feat(agents): migrate six no-tools agents to Pydantic AI substrate #221, merged) + backfill (chore(pa): backfill PR #221 — tests + review fixes that didn't push before merge #249, merged).
This PR (feat: Phase 2 — App Shell with Stitch design system #2) — executor on PA + DeferredToolRequests.
PR Switch license from MIT to AGPL-3.0 for open-core model #3 (chore/drop-opencode-substrate) — delete backend/cliff/engine/, templates + template engine, repo_workspace_runner OpenCode path, the dead constants this review flagged, regen frontend OpenAPI types, and replace the singleton OpenCode (ADR-0022 normalizer) with a small PA agent.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Agent runs now pause with an "awaiting_permission" state for gated tool calls and can be resumed via background/polled approval endpoints.
- New in-process remediation tools (bash, edit, read, webfetch, gh) with safety gating and optional MCP toolset support.
- Executor persists message history so paused runs can be resumed.
Bug Fixes
- Removed reliance on SSE streams for permission flow; improves reliability and cleanup.
Tests
- Added coverage for deferred-tool park/resume, tools, permissions, and rate-limit backoff.

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Migrate the remediation_executor path in AgentExecutor onto the remediation_executor Pydantic AI agent so tool-using runs execute in-process with deferred-tool outputs and HITL resume markers. Update the POST /permission resume endpoint, pipeline handling, migration 023, and frontend polling so permission decisions rely on the durable agent_run marker instead of the deleted SSE/subprocess plumbing.

Topic Details

Executor PA Flow

Orchestrate AgentExecutor to run the remediation_executor via _run_pa_executor/resume_executor, persist DeferredToolRequests markers and message history, and finalize the paused result through the POST /permission flow while the pipeline defers on awaiting_permission instead of failing.

Modified files (12)

backend/cliff/agents/executor.py
backend/cliff/agents/pipeline.py
backend/cliff/api/routes/agent_execution.py
backend/cliff/db/migrations/023_agent_run_pa_message_history.sql
backend/cliff/db/repo_agent_run.py
backend/tests/agents/test_deferred_tools_persist.py
backend/tests/api/openapi_snapshot.json
backend/tests/integration/test_permission_flow.py
backend/tests/integration/test_rate_limit_backoff.py
frontend/src/__tests__/IssueSidePanel.permission-prompt.test.tsx
frontend/src/api/client.ts
frontend/src/components/issues/IssueSidePanel.tsx

Latest Contributors(1)

User	Commit	Date
galank@gmail.com	fix(executor): harden ...	June 01, 2026

Other

Other files

Modified files (6)

backend/cliff/models/__init__.py
backend/pyproject.toml
backend/tests/agents/tools/__init__.py
backend/tests/test_executor.py
backend/tests/test_routes_agent_execution.py
frontend/src/__tests__/IssueSidePanel.sse-progress.test.tsx

Latest Contributors(1)

User	Commit	Date
galank@gmail.com	feat(agents): persiste...	June 01, 2026

Tool Primitives

Implement the in-process tool primitives, permission-tier gate, MCP adapter, and focused tests so the remediation_executor agent can invoke bash/edit/read/webfetch/gh safely and retain the existing auto/ask/deny policy under Pydantic AI.

Modified files (13)

backend/cliff/agents/runtime/remediation_executor.py
backend/cliff/agents/runtime/tools/__init__.py
backend/cliff/agents/runtime/tools/bash.py
backend/cliff/agents/runtime/tools/edit.py
backend/cliff/agents/runtime/tools/gh.py
backend/cliff/agents/runtime/tools/mcp.py
backend/cliff/agents/runtime/tools/permissions.py
backend/cliff/agents/runtime/tools/read.py
backend/cliff/agents/runtime/tools/webfetch.py
backend/tests/agents/test_remediation_executor.py
backend/tests/agents/tools/test_mcp.py
backend/tests/agents/tools/test_permissions.py
backend/tests/agents/tools/test_tools.py

Latest Contributors(1)

User	Commit	Date
galank@gmail.com	feat(agents): executor...	June 01, 2026

_{Review this PR on Baz | Customize your next review}

First slice of PR #2 (ADR-0047 / IMPL-0022) — the five in-process tool functions that replace OpenCode's tool dispatch for the remediation_executor, plus the permission tiering they enforce. New package backend/cliff/agents/runtime/tools/: - permissions.py — classify_tool_request + _CATASTROPHIC_BASH + _GATED_BASH + _is_pipe_to_shell ported VERBATIM from the OpenCode-era executor classifier (same patterns, same tiers). gate_tool_call is the new thin layer that maps a tier to PA's HITL vocabulary: deny → raise ValueError (deterministic, model pivots), ask → raise ApprovalRequired until ctx.tool_call_approved, auto → proceed. - bash.py — async subprocess via asyncio.to_thread (non-blocking), cwd=workspace_dir, 180s per-command timeout, output trimmed to last 200 lines. - edit.py — classifier gate + a second resolved-path containment check against workspace_dir (catches symlink / normalized-separator escapes the textual check misses). - read.py — auto tier, 50 KB cap with truncation marker. - webfetch.py — auto tier, 30s timeout, text/* + application/json allowlist, 50 KB cap. SSRF posture documented (single-tenant CE). - gh.py — thin wrapper that prepends "gh " and delegates to bash so the token stays out of the model-visible command string; GH_TOKEN flows via deps.env_vars. The OpenCode classifier copy in executor.py stays until PR2.E deletes it; both are identical and both are tested. Tests (44): - test_permissions.py — the full classifier table ported verbatim from the OpenCode-era TestClassifyToolRequest, plus gate behavior (deny→ValueError even when approved, ask→ApprovalRequired, ask-approved→proceeds). - test_tools.py — functional coverage of all five tools against a tmp_path workspace (run/trim/deny/ask/approve for bash; relative write + absolute/traversal/symlink-escape gating for edit; cap + missing-file for read; content-type allowlist for webfetch via httpx_mock; token-missing hint + bash delegation for gh). ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Second slice of PR #2 (ADR-0047 / IMPL-0022): the tool-using remediation_executor agent definition, plus the MCP-config adapter and the runtime fixes that registering tools on a real Agent surfaced. New: - runtime/remediation_executor.py — build_agent(model, mcp_toolsets=()) returns Agent[WorkspaceDeps, RemediationExecutorOutput | DeferredToolRequests], registering the five tools. SYSTEM_PROMPT is lifted from the OpenCode-era Jinja template — workspace-safety block and all seven hard rules preserved verbatim (each = a real shipped regression); the dynamic finding/plan context now arrives via the shared build_user_prompt (as the no-tools agents consume it); the JSON output-contract block is gone (output_type enforces the schema). The union output_type is what lets a gated tool's ApprovalRequired pause the run with a DeferredToolRequests output instead of erroring. - runtime/tools/mcp.py — build_mcp_toolsets converts Cliff's resolved OpenCode-shaped MCP configs ({type, command, environment} / {type, url}) into PA MCPServerStdio / MCPServerStreamableHTTP toolsets, tool_prefix'd by integration id, malformed entries skipped not fatal. Two runtime discoveries from wiring tools onto an Agent: 1. PA introspects each tool's type hints via get_type_hints at registration. With `from __future__ import annotations` the RunContext[WorkspaceDeps] annotation is a string that must resolve against module globals — so RunContext + WorkspaceDeps move from TYPE_CHECKING to runtime imports in all five tool modules. Scoped ruff per-file-ignore (TC001/TC002) for runtime/tools/*, matching the existing cliff/models/*.py precedent. 2. deny tier raises ModelRetry, not ValueError. IMPL-0022 named ValueError, but PA propagates a raw tool ValueError (crashing the run) whereas ModelRetry is converted to a RetryPromptPart the model sees and pivots on — which is exactly ADR-0047's stated intent ("the model gets a deterministic error and pivots"). Verified in PA's tool_manager.py. A model that keeps retrying a denied command hits PA's per-tool retry cap; no infinite loop. The verbatim classifier (auto/ask/deny classification) is unchanged — only the tier→exception translation differs. Flagged for an IMPL-0022 text update. MCPServerStdio/StreamableHTTP are deprecated in PA 1.98 in favour of MCPToolset, but MCPToolset's arbitrary-command form needs fastmcp (not a dependency on the pinned line) — so the deprecated-but-working classes are the only ones that express our `npx …` configs. Deprecation suppressed at the construction site with a documented reason; revisit at the PA v2 bump (tracked with the PR #3 config rename). Tests (55 new): test_remediation_executor.py drives the agent with TestModel (normal completion → RemediationExecutorOutput) and FunctionModel (gated `rm -rf` → DeferredToolRequests with the pending approval + command metadata; catastrophic `sudo` → ModelRetry → model pivots, no defer). test_mcp.py covers the adapter (local→stdio, remote→http, disabled/malformed/unknown skipped). Full agents suite: 98 passed. ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…C/1) Durable storage for the PA DeferredToolRequests resume cycle, ahead of the executor dispatch + resume wiring: - migration 023 — adds nullable agent_run.pa_message_history TEXT. When the remediation_executor calls a gated tool the run pauses with a DeferredToolRequests output; resuming needs the full conversation fed back via agent.run(message_history=...). Storing PA's all_messages_json() in the DB (not in-memory / not a workspace file) keeps it atomic with the existing permission_request marker and survives a daemon restart mid-approval — same durability the marker got in migration 022. - AgentRunUpdate.pa_message_history field so update_agent_run writes it; deliberately NOT on the API-facing AgentRun model (large internal blob). - repo_agent_run.get_pa_message_history(run_id) — targeted reader for the resume path. - reconcile_orphaned_agent_runs also NULLs pa_message_history so a restart mid-pause can't leave stale resume state (mirrors how it already clears permission_pending/permission_request). Verified: migration applies + column present; 84 model/db/migration tests pass; ruff clean. Dispatch + _run_pa_executor + resume API land in the next commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

….C/2) Routes the remediation_executor off OpenCode onto the in-process Pydantic AI agent + the durable DeferredToolRequests park/resume cycle. Executor (executor.py): - _finalize_run: extracted from execute()'s post-parse+persist tail (ref-verify, evidence-guard, PR-verify, context/sidebar/advance, run-update) so the initial run AND the resume path reach identical finalization — one source of truth. Pure refactor, verified no-op against the existing suite before the new code landed. - _resolve_active_model + _run_pa_call: shared helpers (resolver gather → build_model; and the rate-limit/timeout/UnexpectedModelBehavior translation loop) used by both the no-tools path and the executor — removes ~145 lines of duplication. - _run_pa_executor: builds the executor agent (build_agent + resolved MCP toolsets), runs it via _run_pa_call. A DeferredToolRequests output → _PaExecutorOutcome carrying the permission marker + all_messages_json(); a completed output → a ParseResult shaped like the no-tools path. - _park_for_permission: persists the marker (permission_pending=1 + permission_request) + pa_message_history, returns the new ``awaiting_permission`` AgentExecutionResult status (run row stays ``running`` so issue_derivation routes to "Needs you"). - resume_executor: reloads deps, deserializes the message history, re-enters agent.run with DeferredToolResults (ToolApproved/ToolDenied per pending tool_call_id), then _finalize_run (or re-park). - execute() dispatch: remediation_executor → _run_pa_executor; the old OpenCode tool-agent block is gone (its orphaned plumbing — _send_and_collect, the permission queue — is deleted in PR2.E). pipeline.py: run_pipeline also stops on ``awaiting_permission`` (a non-terminal pause, not a failure). route (agent_execution.py): POST .../permission now resumes via executor.resume_executor in a background task and returns immediately; the frontend polls agent-runs for the outcome (no SSE — PR2.D deletes it). tools: bash/edit gate metadata now carries ``patterns`` so the persisted marker matches what the frontend approval prompt renders (tool + patterns). Tests: - test_deferred_tools_persist.py (new): real FunctionModel park→resume cycle — gated bash pauses with a marker + history, resume(ToolApproved) completes, marker cleared; resume-without-marker raises. - test_executor.py: keeper tests retargeted to the PA path via a shared _executor_with_pa helper; OpenCode-machinery tests (pool-start, SSE error/progress, parse-retry, asyncio.Event approve) skipped with PR2.E reasons (same staged pattern PR #1 used). - test_rate_limit_backoff.py: rewritten to exercise the shared _run_pa_call retry via a flaky run_no_tools_agent raising ModelHTTPError(429). The OpenCode string-classification corpus tests are retired — PA detects rate-limit structurally (status_code==429), not by matching error-message text. Jitter unit test preserved. - test_permission_flow.py: 2 OpenCode approve/deny flow tests skipped (covered by test_deferred_tools_persist.py); the derivation-routing test stays. - test_routes_agent_execution.py: TestPermissionEndpoint retargeted to the background resume_executor flow. Full slice-C suite: 311 passed, 10 skipped, 0 failed; ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ery (PR2.D+E) With the executor on Pydantic AI + a durable DeferredToolRequests marker (PR2.C), the whole push-based progress/approval channel is dead weight — the frontend already polls agent-runs (3s active / 5s idle) and renders the permission prompt + activity feed from that. ADR-0047's "much simpler mechanism": executor writes the marker → frontend polls → renders. PR2.D — delete the frontend SSE: - backend: removed GET /workspaces/{id}/agent-execution/stream + its event_generator; dropped the on_permission=push_permission_event wiring from execute_agent + run_all. - frontend: removed the EventSource useEffect from IssueSidePanel (and the now-unused useQueryClient/api imports) and api.streamAgentExecution. The permission prompt + feed render entirely from the polled agent-runs query — the SSE was only ever a refetch nudge, carried no data, and streamed no tokens. - tests: deleted IssueSidePanel.sse-progress.test.tsx + the EventSource stub/assertion in the permission-prompt test; removed the backend SSE stream route tests. PR2.E — delete the orphaned OpenCode permission/streaming machinery in executor.py: - _PendingApproval, approve_tool, deny_tool, push_permission_event, get_permission_queue, ensure_permission_queue, _patch_permission_marker - _send_and_collect + _collect_response (the OpenCode SSE consumer) - _RETRY_PROMPT + the dead _RATE_LIMIT_SUBSTRINGS / _looks_like_rate_limit string-classifier (PA detects rate-limit structurally via ModelHTTPError.status_code == 429) - _cleanup_workspace_state trimmed to just clearing the active-run marker; the _pending_approvals / _permission_queues / _permission_pending dicts are gone. (_active_runs / get_active_run_id / check_not_busy stay — those are run-tracking, not SSE.) - tests: deleted TestPermissionApproval; skipped the 3 SSE-progress-event tests in TestAgentExecutor (PR2.D removed the events). Note: cliff/workspace/repo_workspace_runner.py keeps its OWN _send_and_collect (the repo-action generators are a separate OpenCode path that PR #3 migrates). _STRUCTURED_OUTPUT_CONTRACTS + build_agent_prompt also stay for PR #3's plumbing/template deletion. openapi snapshot regenerated: drops the stream path, adds the optional deny_message field on PermissionDecision (the resume API). Verified: ruff clean; backend non-e2e suite 1570 passed / 14 skipped (only failure is the pre-existing wiz-s3-public LLM-eval flake, untouched by this branch); the SSE-touched frontend tests (permission-prompt + IssueSidePanel core) pass. The 10 failing frontend chip-styling tests are pre-existing design-token drift in components this branch never touches. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tate /code-review (recall) confirmed: the resume setup steps inside the try — ModelMessagesTypeAdapter.validate_json on a corrupt/truncated pa_message_history, or a malformed tool_call_ids — raise ValidationError/TypeError, which the specific except (AgentTimeoutError|AgentRateLimitError|AgentProcessError) handler didn't cover. The exception escaped to the route's catch-all logger and the run stayed running + permission_pending forever (workspace busy until daemon restart). Two fixes: - tool_call_ids = (... or {}).get("tool_call_ids") or [] — an explicit null in the marker JSON can no longer slip a None into the approvals comprehension. - Added a catch-all except in resume_executor that marks the run failed (cleanup + status=failed + last_error), so NO resume error can leave the run wedged — the finding routes to the Retry CTA instead of hanging. Test: test_resume_with_corrupt_history_marks_failed_not_wedged — a non-JSON pa_message_history yields status=failed + marker cleared, not a raised exception or a stuck run. Review notes (no change needed): - background-task DB use is safe — get_db yields a process-global singleton with no per-request close (same pattern execute_agent uses). - edit approved-escape is by-design — PA sets ctx.tool_call_approved per-tool-call-id, so an approved escaping edit is the user consenting to that exact path (HITL). - Dead OpenCode constants (PERMISSION_WAIT_BUFFER, TOOL_TIERS, _STRUCTURED_OUTPUT_CONTRACTS, build_agent_prompt) + stale frontend OpenAPI types (deleted stream endpoint, missing deny_message) are PR #3 / type-regen cleanup, deliberately deferred. 80 passed / 11 skipped on the executor+resume+route slice; ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-01T13:05:17Z

📝 Walkthrough

Walkthrough

This PR replaces the OpenCode subprocess/SSE remediation executor with an in-process Pydantic AI remediation agent, adds gated tool approval with durable pause/resume (pa_message_history), centralizes PA call/retry/error handling, removes SSE streaming in favor of polling, and updates DB, API, frontend, and tests.

Changes

PA executor migration and permission deferral

Layer / File(s)	Summary
Pipeline stop condition `backend/cliff/agents/pipeline.py`	`run_pipeline` treats `awaiting_permission` as a stopping/pausing status.
Remediation executor runtime `backend/cliff/agents/runtime/remediation_executor.py`	Adds `SYSTEM_PROMPT` and `build_agent(...)` producing union output supporting DeferredToolRequests for gated tool deferral.
Tools package init `backend/cliff/agents/runtime/tools/__init__.py`	Registers ordered `EXECUTOR_TOOLS` tuple exported for the remediation agent.
Bash tool `backend/cliff/agents/runtime/tools/bash.py`	Async `bash` tool with gating, background subprocess execution, timeout, output trimming, and export.
GH wrapper tool `backend/cliff/agents/runtime/tools/gh.py`	`gh` validates `GH_TOKEN` then delegates to `bash`.
DB migration and repo helpers `backend/cliff/db/migrations/023_agent_run_pa_message_history.sql`, `backend/cliff/db/repo_agent_run.py`	Add `pa_message_history` column and repo helper `get_pa_message_history`; reconciliation clears the field when failing orphaned runs.
Model updates / OpenAPI `backend/cliff/models/__init__.py`, `backend/tests/api/openapi_snapshot.json`	Add `pa_message_history` to `AgentRunUpdate` and `deny_message` to `PermissionDecision`; snapshot updated and streaming path removed.
Remediation executor tests `backend/tests/agents/test_remediation_executor.py`	Tests for system prompt safety, tool registration, normal completion, deferral via DeferredToolRequests, and deny-tier handling.
Permission classifier & gating tests `backend/tests/agents/tools/test_permissions.py`	Tests for `classify_tool_request` and `gate_tool_call` including ModelRetry and ApprovalRequired semantics.
API permission endpoint refactor `backend/cliff/api/routes/agent_execution.py`	`respond_to_permission` resumed paused executor asynchronously with `approved`/`deny_message`; removed on_permission callbacks at background launch sites.
AgentExecutor core changes `backend/cliff/agents/executor.py`	Add Pydantic AI message/tool primitives, remove OpenCode-era helpers, extend `AgentExecutionResult.status` with `awaiting_permission`, add `_run_pa_call`/_resolve_active_model with timeout/retry/exception mapping, initialize `_active_runs`.
PA executor, park/resume, finalization `backend/cliff/agents/executor.py`	Implement `_run_pa_executor`, `_park_for_permission`, `resume_executor` (rehydrate message history, apply approvals/denials), and centralized `_finalize_run`.
Edit tool `backend/cliff/agents/runtime/tools/edit.py`	`edit` tool with gating, resolved-path containment check, ApprovalRequired on escapes, background write, and relative-path display.
MCP toolset adapter `backend/cliff/agents/runtime/tools/mcp.py`	Adapter building MCP toolsets from config dicts, skipping malformed/disabled entries with warnings.
Permissions classifier implementation `backend/cliff/agents/runtime/tools/permissions.py`	Deny/ask/auto tiers, catastrophic denylist and gated patterns, `_is_pipe_to_shell`, `classify_tool_request`, and `gate_tool_call` raising ModelRetry or ApprovalRequired.
Read tool `backend/cliff/agents/runtime/tools/read.py`	Workspace-scoped read tool with 50KB cap, escape refusal, threaded read, decode-with-replacement, and truncation marker.
Webfetch tool `backend/cliff/agents/runtime/tools/webfetch.py`	HTTP GET wrapper with httpx streaming, content-type allowlist, 50KB cap, truncation marker, and formatted response.
Frontend SSE removal `frontend/src/api/client.ts`, `frontend/src/components/issues/IssueSidePanel.tsx`, `frontend/src/__tests__/IssueSidePanel.permission-prompt.test.tsx`	Removed agent-execution SSE client method and in-component EventSource; added `respondToPermission`/`respondToChatPermission` client endpoints and updated tests to rely on polled `agent-runs`.
Deferred tools park/resume tests `backend/tests/agents/test_deferred_tools_persist.py`	End-to-end tests asserting run parks with permission marker + persisted PA message history, and resume behavior for approve/deny, plus corrupt-history and marker-shaping tests.
MCP tests `backend/tests/agents/tools/test_mcp.py`	Tests for local/remote MCP toolset construction, skipping disabled/malformed/unknown-type entries.
Tool functional tests `backend/tests/agents/tools/test_tools.py`	Functional tests for bash/edit/read/webfetch/gh covering execution, safety gating, escapes, truncation, and delegation.
Executor unit test updates `backend/tests/test_executor.py`	Refactor tests to stub PA execution (_executor_with_pa/_pa_success_outcome/_pr_created_output), skip OpenCode/SSE-specific tests, and rewire PR-verification/timeouts to PA seam.
Rate-limit backoff tests `backend/tests/integration/test_rate_limit_backoff.py`	Rewrote to use ModelHTTPError-driven 429 detection and flaky-run helpers, plus zeroed-backoff helper and sustained-throttling scenario.
Permission route tests `backend/tests/test_routes_agent_execution.py`	Updated tests to assert `/permission` schedules `resume_executor` for pending runs and returns 404 for non-pending or cross-workspace runs.
Integration skips and OpenAPI `backend/tests/integration/test_permission_flow.py`, `backend/tests/api/openapi_snapshot.json`	Legacy in-process approval tests skipped; OpenAPI snapshot updated with new fields and streaming path removed.
Lint config `backend/pyproject.toml`	Add per-file Ruff ignore for TC001/TC002 in runtime tools path.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

cliff-security/cliff#249: Touches Pydantic-AI execution retry/error handling and run_pa* logic, which relates to this PR’s centralized PA call handling.
cliff-security/cliff#213: Frontend permission prompt UI changes that complement this PR’s paused/permission flow semantics.

Poem

"🐰 A rabbit in the toolchain hops,
Pauses runs to check for props,
Saves the chat to pick up later,
Approves or denies with human gator,
SSE fades — polling keeps the crops."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.89% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely describes the main change: migrating the remediation_executor from OpenCode subprocess to Pydantic AI, indicating this is PR `#2` of a larger series.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/pa-migrate-executor

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

frontend/src/__tests__/IssueSidePanel.permission-prompt.test.tsx (1)

63-86: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Exercise the paused-run contract in the fixture.

These tests still model the gated executor row as status: 'running', but the new backend flow parks it as awaiting_permission. Leaving the default fixture on the old status means this suite won’t catch the footer bug above once the real API starts returning parked runs.

Suggested fix

 function makeRun(overrides: Partial<AgentRun> = {}): AgentRun {
   return {
     id: 'run-1',
     workspace_id: 'ws-1',
     agent_type: 'remediation_executor',
-    status: 'running',
+    status: 'awaiting_permission',
     input_json: null,
     summary_markdown: null,
     confidence: null,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/__tests__/IssueSidePanel.permission-prompt.test.tsx` around
lines 63 - 86, The fixture makeRun currently returns a run with status 'running'
which prevents tests from exercising the paused-run contract; change the default
status in makeRun to 'awaiting_permission' so the
permission_pending/permission_request path is exercised by default (keep
permission_pending: true and permission_request as-is and allow callers to
override via the overrides parameter so existing tests can still set other
statuses when needed).

frontend/src/components/issues/IssueSidePanel.tsx (1)

1162-1203: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Match the parked run when rendering the permission footer.

This branch only looks up status === 'running', but the new park/resume flow persists the paused executor as awaiting_permission. In that case runningRun stays null, the footer remains stuck on “Waiting for approval details”, and the user can never approve or deny the deferred tool call.

Suggested fix

-  const runningRun =
-    agentRuns?.find((r) => r.status === 'running') ?? null
+  const actionableRun =
+    agentRuns?.find(
+      (r) => r.status === 'running' || r.status === 'awaiting_permission',
+    ) ?? null

   if (stage === 'awaiting_permission') {
-    const req = runningRun?.permission_request ?? null
-    if (!req || !runningRun) {
+    const req = actionableRun?.permission_request ?? null
+    if (!req || !actionableRun) {
       return (
         <div className="flex items-center gap-3 w-full">
           <CliffSpinner size={14} label="Loading approval details" />
@@
         errorMessage={
           respondToPermission.isError
             ? friendlyPermissionError(respondToPermission.error)
             : null
         }
         onApprove={() =>
-          respondToPermission.mutate({ runId: runningRun.id, approved: true })
+          respondToPermission.mutate({ runId: actionableRun.id, approved: true })
         }
         onDeny={() =>
-          respondToPermission.mutate({ runId: runningRun.id, approved: false })
+          respondToPermission.mutate({ runId: actionableRun.id, approved: false })
         }
       />
     )
   }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/components/issues/IssueSidePanel.tsx` around lines 1162 - 1203,
The permission-footer branch only looks up agentRuns.find(...) for status ===
'running', so when a run is parked as 'awaiting_permission' the UI never binds
the parked run and stays in the loading state; change the lookup used by
IssueSidePanel (the runningRun constant) to include both statuses (e.g.,
agentRuns?.find(r => r.status === 'running' || r.status ===
'awaiting_permission') ?? null) so the PermissionPrompt receives the parked run
(its permission_request and id) and the respondToPermission.mutate calls will
target the correct run.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/cliff/agents/executor.py`:
- Around line 833-834: The code calls _finalize_run assuming
outcome.parse_result exists even though _PaExecutorOutcome.parse_result can be
None; add an explicit guard/invariant before calling _finalize_run: check that
outcome.parse_result is not None (raise an AssertionError or ValueError with a
clear message if it is) and then pass the non-None parse_result to
_finalize_run; apply the same explicit check in the other occurrence referenced
(around the 1617–1622 block) to keep typing strict and avoid hidden None runtime
paths, and update any local type narrowing so callers treat parse_result as
non-Optional after the check.

In `@backend/cliff/agents/runtime/tools/bash.py`:
- Around line 62-81: The current _run() uses subprocess.run(...) and
asyncio.to_thread which buffers all output and may leave child processes alive
on timeout; replace this with an asyncio-based spawn using
asyncio.create_subprocess_shell/create_subprocess_exec (or subprocess.Popen with
start_new_session=True) so the process is started in a new session/process
group, stream stdout/stderr into bounded deques (or byte/line ring buffers) to
enforce output limits instead of capture_output=True, and enforce the timeout
with asyncio.wait_for; on timeout kill the whole process group
(os.killpg(process.pid, signal.SIGTERM) then SIGKILL if needed) and return the
trimmed buffered output. Update references in this file: replace uses of _run,
subprocess.run, and asyncio.to_thread with the new async streaming
implementation, keep using ctx.deps.workspace_dir, ctx.deps.env_vars, and
_BASH_TIMEOUT_SECONDS, and ensure _trim_output is replaced/rewired to use the
bounded buffers you populate incrementally.

In `@backend/cliff/agents/runtime/tools/edit.py`:
- Around line 62-70: The post-write call to target.relative_to(...) can raise
ValueError if the approved target is outside the workspace, turning a successful
write into a failure; wrap the relative_to() call in a try/except (catch
ValueError) and on exception produce a safe fallback path string (for example
use str(target) or a posix-style "../..." path) so the function still returns
success; modify the block around rel =
target.relative_to(Path(ctx.deps.workspace_dir).resolve()) in edit.py
(referencing target and the _write function) to attempt relative_to and fall
back to a non-crashing representation when ValueError is raised.

In `@backend/cliff/agents/runtime/tools/mcp.py`:
- Around line 85-103: Validate the MCP config types before constructing
MCPServerStdio or MCPServerStreamableHTTP: ensure `command` is a non-empty
list/tuple of strings (not a single string) before using command[0] and args,
and ensure `url` is a string before constructing the remote server; if the
values are malformed, raise a ValueError with a clear message (including
`entry_id`) so callers can skip the malformed toolset instead of invoking an
invalid tool.
- Around line 42-49: The helper context manager _suppress_mcp_v2_deprecation
lacks an explicit return type; annotate its signature as -> Iterator[None] and
add the necessary import (from typing import Iterator) so the function becomes
typed as an iterator-based contextmanager for backend strict typing (update the
_suppress_mcp_v2_deprecation definition and imports accordingly).

In `@backend/cliff/agents/runtime/tools/read.py`:
- Around line 28-45: The read tool currently resolves target =
(Path(ctx.deps.workspace_dir) / path).resolve() and opens it without approval or
containment checks; update the code to (1) call the centralized approval/gating
function (e.g., ctx.deps.gate_tool_call or equivalent) before reading, and (2)
enforce that the resolved target is inside the workspace by comparing
Path(ctx.deps.workspace_dir).resolve() and using
target.relative_to(workspace_dir) (or an explicit check that
target.resolve().is_relative_to(workspace_dir.resolve()) / raises if not) and
return a safe error if the check fails; keep the existing _read function and
asyncio.to_thread usage and preserve _MAX_READ_BYTES handling but never open
files outside the workspace or via symlink traversal without explicit approval.

In `@backend/cliff/agents/runtime/tools/webfetch.py`:
- Around line 31-52: The current implementation uses await client.get(url) and
resp.text which buffers the entire response before checking content-type and
enforcing _MAX_BODY_BYTES; change to a streaming fetch using
httpx.AsyncClient.stream (e.g., async with client.stream("GET", url) as resp) so
you can check resp.headers.get("content-type") against _ALLOWED_CONTENT_TYPES
before reading the body, then read chunks (resp.aiter_bytes or resp.aiter_raw)
accumulating at most _MAX_BODY_BYTES + 1 bytes and stop once that limit is
reached to detect truncation, decode the collected bytes with errors="replace",
append the truncation notice if the extra byte was read, and return the same
"HTTP {resp.status_code}\n{body}" string; preserve the existing HTTPError
handling around the request and continue using _WEBFETCH_TIMEOUT_SECONDS for the
client timeout.

In `@backend/cliff/api/routes/agent_execution.py`:
- Around line 532-540: The handler respond_to_permission currently checks that
the run exists and is permission_pending but does not verify ownership; after
calling get_agent_run and get_workspace, add an explicit check that
run.workspace_id == workspace_id (reference: get_agent_run, run.workspace_id,
get_workspace, respond_to_permission) and if they do not match raise an
HTTPException (404 or 403 per project convention) with an appropriate message
like "Agent run not found in this workspace" before proceeding with the
permission decision.
- Line 518: The new endpoint's db parameter is untyped; annotate it as a
SQLAlchemy Session to keep strict typing consistent by changing the parameter
from db=Depends(get_db) to db: Session = Depends(get_db) in the endpoint
signature (reference the db parameter and the get_db dependency), and add the
corresponding import (from sqlalchemy.orm import Session) at the top of the
module.

In `@backend/cliff/models/__init__.py`:
- Around line 173-176: The AgentRunStatus union/enum is missing the
"awaiting_permission" state which causes Pydantic validation to fail for
AgentRunUpdate paths that set status to "awaiting_permission"; update the
AgentRunStatus definition in backend/cliff/models/__init__.py (the
AgentRunStatus type/Enum used by AgentRun and AgentRunUpdate) to include the
literal/value "awaiting_permission", and run through any related validators or
type annotations (e.g., usages of AgentRunStatus, AgentRunUpdate, or status
fields) to ensure they accept this new state so pause/resume persistence and
park/resume flows validate correctly.

In `@backend/tests/agents/test_deferred_tools_persist.py`:
- Around line 39-45: The new fixture workspace_dir and other newly added helper
functions in this test module lack type hints; add strict annotations for
parameters and return types (e.g., annotate the pytest tmp_path parameter as
pathlib.Path and the fixture return as str for workspace_dir), and similarly
annotate all functions referenced in the review (the defs around lines 51-57 and
77-113) with concrete types (use pathlib.Path, dict[str, Any], list[T], bool, or
specific models as appropriate) so the module complies with the backend strict
typing guideline.

In `@backend/tests/agents/test_remediation_executor.py`:
- Around line 60-66: The local callback functions like _call_gated_bash need
explicit type annotations to satisfy strict typing: add parameter and return
type hints (e.g., def _call_gated_bash(messages: list[Message], info:
FunctionCallbackInfo) -> ModelResponse:) matching your FunctionModel callback
signature and any other callbacks in the same file (lines ~87-103), import or
reference the correct Message and FunctionCallbackInfo/ModelResponse types from
the codebase, and ensure any dicts/args use typing (dict[str, Any]) where
appropriate so the functions are fully typed.

In `@backend/tests/agents/tools/test_tools.py`:
- Around line 113-115: The test test_approved_write_outside_proceeds is passing
a workspace-relative path ("src/ok.py") so it never hits the "outside" edit
branch; change the test to pass an actual filesystem path outside the workspace
(e.g. construct an absolute path using tmp_path / "ok_outside.py" and ensure the
file exists) when calling edit(_ctx(tmp_path, approved=True), <outside_path>,
"y\n") so the outside/ask-tier path is exercised, then keep the existing assert
"Wrote" in result; adjust only the path argument to edit and any necessary test
file setup.

In `@backend/tests/integration/test_rate_limit_backoff.py`:
- Around line 91-93: The helper factory _make_flaky_run should declare full type
hints: annotate parameters (error_attempts: int, status_code: int = 429,
always_error: bool = False) and specify a precise return type (e.g.,
Callable[..., Coroutine[Any, Any, ReturnType]] or Callable[[...],
Awaitable[ResponseType]] depending on what the helper returns); update the inner
helper callables referenced around lines 102-103 and 113-116 similarly so every
helper parameter and return is explicitly typed (use typing.Callable,
typing.Awaitable/Coroutine, and concrete response/result types from the test
helpers or HTTP client types) to satisfy strict Python 3.11 backend typing
rules.

In `@backend/tests/test_executor.py`:
- Around line 111-116: Add strict type hints to the PA test helper functions:
annotate _executor_with_pa parameters and return type — e.g., declare mock_pool:
unittest.mock.Mock (or the specific Pool type), mock_context_builder:
Callable[..., Any], outcome_factory: Callable[[], _PaExecutorOutcome] and give
the function a concrete return type (the Executor/test helper type it produces);
similarly add matching type annotations to the other helper(s) referenced around
lines 135-150. Ensure you import typing symbols (Callable, Any) and the
_PaExecutorOutcome and Executor types so the signatures are fully typed.

In `@backend/tests/test_routes_agent_execution.py`:
- Around line 767-769: Replace the timing-dependent asyncio.sleep usage with an
explicit synchronization primitive: create an asyncio.Event in the test and
update the test helper that appends to the calls list (the same place that
currently appends ("run-1", True)) to call event.set() when it records the call;
then await asyncio.wait_for(event.wait(), timeout=...) instead of
asyncio.sleep(0.02). Apply the same change for the other occurrence that waits
for the background resume (lines referencing calls and the ("run-1", True)
assertion) so the test deterministically waits for the background resume to
complete.
- Around line 709-714: Annotate the new helper and the async test signatures
with explicit type hints: change def _permission_route_patches(*, pending: bool)
to include a return type (for example def _permission_route_patches(*, pending:
bool) -> Iterator[None] and import Iterator from typing, or use
ContextManager[Any] if it yields a context manager), and ensure every async test
function in this file has an explicit return annotation (e.g. async def
test_xyz(...) -> None). Also add missing parameter type annotations for any
other params referenced in those helpers (e.g. any mocked objects) so all new
helper and test signatures are fully typed per backend strict-typing rules.

---

Outside diff comments:
In `@frontend/src/__tests__/IssueSidePanel.permission-prompt.test.tsx`:
- Around line 63-86: The fixture makeRun currently returns a run with status
'running' which prevents tests from exercising the paused-run contract; change
the default status in makeRun to 'awaiting_permission' so the
permission_pending/permission_request path is exercised by default (keep
permission_pending: true and permission_request as-is and allow callers to
override via the overrides parameter so existing tests can still set other
statuses when needed).

In `@frontend/src/components/issues/IssueSidePanel.tsx`:
- Around line 1162-1203: The permission-footer branch only looks up
agentRuns.find(...) for status === 'running', so when a run is parked as
'awaiting_permission' the UI never binds the parked run and stays in the loading
state; change the lookup used by IssueSidePanel (the runningRun constant) to
include both statuses (e.g., agentRuns?.find(r => r.status === 'running' ||
r.status === 'awaiting_permission') ?? null) so the PermissionPrompt receives
the parked run (its permission_request and id) and the
respondToPermission.mutate calls will target the correct run.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 93b1665d-8d71-4054-b0aa-65255aef8408

📥 Commits

Reviewing files that changed from the base of the PR and between cef63fb and c25f096.

📒 Files selected for processing (31)

backend/cliff/agents/executor.py
backend/cliff/agents/pipeline.py
backend/cliff/agents/runtime/remediation_executor.py
backend/cliff/agents/runtime/tools/__init__.py
backend/cliff/agents/runtime/tools/bash.py
backend/cliff/agents/runtime/tools/edit.py
backend/cliff/agents/runtime/tools/gh.py
backend/cliff/agents/runtime/tools/mcp.py
backend/cliff/agents/runtime/tools/permissions.py
backend/cliff/agents/runtime/tools/read.py
backend/cliff/agents/runtime/tools/webfetch.py
backend/cliff/api/routes/agent_execution.py
backend/cliff/db/migrations/023_agent_run_pa_message_history.sql
backend/cliff/db/repo_agent_run.py
backend/cliff/models/__init__.py
backend/pyproject.toml
backend/tests/agents/test_deferred_tools_persist.py
backend/tests/agents/test_remediation_executor.py
backend/tests/agents/tools/__init__.py
backend/tests/agents/tools/test_mcp.py
backend/tests/agents/tools/test_permissions.py
backend/tests/agents/tools/test_tools.py
backend/tests/api/openapi_snapshot.json
backend/tests/integration/test_permission_flow.py
backend/tests/integration/test_rate_limit_backoff.py
backend/tests/test_executor.py
backend/tests/test_routes_agent_execution.py
frontend/src/__tests__/IssueSidePanel.permission-prompt.test.tsx
frontend/src/__tests__/IssueSidePanel.sse-progress.test.tsx
frontend/src/api/client.ts
frontend/src/components/issues/IssueSidePanel.tsx

💤 Files with no reviewable changes (2)

frontend/src/tests/IssueSidePanel.sse-progress.test.tsx
frontend/src/api/client.ts

coderabbitai · 2026-06-01T13:23:25Z

+                parse_result = outcome.parse_result
            else:


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard outcome.parse_result before calling _finalize_run.

Both paths assume parse_result exists when permission_request is absent, but _PaExecutorOutcome allows parse_result=None. Add an explicit invariant check to prevent a hidden None runtime path and keep typing strict.

Proposed fix

if outcome.permission_request is not None: return await self._park_for_permission( db, workspace_id, agent_run, outcome, start_time ) - parse_result = outcome.parse_result + if outcome.parse_result is None: + raise AgentProcessError( + "Executor outcome missing parse_result for completed run." + ) + parse_result = outcome.parse_result ... if outcome.permission_request is not None: return await self._park_for_permission( db, workspace_id, agent_run, outcome, start_time ) + if outcome.parse_result is None: + raise AgentProcessError( + "Executor outcome missing parse_result for completed resume." + ) return await self._finalize_run( db, workspace_id=workspace_id, agent_type=agent_run.agent_type, parse_result=outcome.parse_result,

As per coding guidelines, backend/**/*.py: Use Python 3.11+ with strict type hints and Pydantic models for backend code.

Also applies to: 1617-1622

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/cliff/agents/executor.py` around lines 833 - 834, The code calls _finalize_run assuming outcome.parse_result exists even though _PaExecutorOutcome.parse_result can be None; add an explicit guard/invariant before calling _finalize_run: check that outcome.parse_result is not None (raise an AssertionError or ValueError with a clear message if it is) and then pass the non-None parse_result to _finalize_run; apply the same explicit check in the other occurrence referenced (around the 1617–1622 block) to keep typing strict and avoid hidden None runtime paths, and update any local type narrowing so callers treat parse_result as non-Optional after the check.

Commit 9706d20 addressed this comment by adding an explicit guard that raises an AgentProcessError whenever the main tool-agent path returns a None parse_result, ensuring _finalize_run always receives a valid result after permission_request is absent.

Skipped: comment is from another GitHub bot.

coderabbitai · 2026-06-01T13:23:26Z

+    def _run() -> subprocess.CompletedProcess[str]:
+        return subprocess.run(
+            command,
+            shell=True,  # noqa: S602 — the remediation agent legitimately
+            # needs a shell (pipes, &&, redirects); the denylist gate above
+            # is the defense-in-depth layer, not shell=False.
+            cwd=ctx.deps.workspace_dir,
+            capture_output=True,
+            text=True,
+            timeout=_BASH_TIMEOUT_SECONDS,
+            env=ctx.deps.env_vars or None,
+        )
+
+    try:
+        proc = await asyncio.to_thread(_run)
+    except subprocess.TimeoutExpired:
+        return (
+            f"Command timed out after {_BASH_TIMEOUT_SECONDS:.0f}s and was "
+            "killed. If this was a long build, split it into smaller steps."
+        )


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="backend/cliff/agents/runtime/tools/bash.py" # Show the relevant portion with line numbers nl -ba "$FILE" | sed -n '1,220p' # Focus search for helpers/constants used in/around the snippet rg -n "_trim_output|_BASH_TIMEOUT_SECONDS|trim|denylist|subprocess\.run|asyncio\.to_thread" "$FILE" # Also locate any other references to this tool in the repo rg -n "tools/bash\.py|def bash\(|async def bash\(" backend | head -n 50

Repository: cliff-security/cliff

Length of output: 107

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="backend/cliff/agents/runtime/tools/bash.py" python3 - <<'PY' import itertools path = "backend/cliff/agents/runtime/tools/bash.py" with open(path, "r", encoding="utf-8") as f: lines = f.readlines() start, end = 40, 120 for i in range(start-1, min(end, len(lines))): print(f"{i+1:4d}: {lines[i].rstrip()}") PY rg -n "_trim_output|_BASH_TIMEOUT_SECONDS|trim|denylist|subprocess\.run|asyncio\.to_thread" "$FILE" # show full definitions around the match if any rg -n "def _run|async def bash|asyncio\.to_thread|shell=True" "$FILE" -n

Repository: cliff-security/cliff

Length of output: 3182

🌐 Web query:

Python subprocess.run shell=True timeout TimeoutExpired child processes continue after timeout capture_output buffers full stdout stderr in memory

💡 Result:

When using subprocess.run with shell=True, timeout, and capture_output=True, you may encounter several critical behaviors: 1. Orphaned Child Processes: If the command executed via shell=True spawns its own child processes, terminating the shell process (the direct child of Python) upon a TimeoutExpired exception does not automatically terminate those "grandchild" processes [1][2]. These descendants are orphaned and inherited by the init process (PID 1), continuing to run in the background [3][2]. 2. Memory Buffering and Deadlocks: When capture_output=True is used, subprocess.run internally uses subprocess.PIPE for stdout and stderr [4][5]. These pipes are managed via OS-level buffers [6]. If a subprocess generates enough output to fill these buffers (typically 64 KB on many systems) and the parent process stops reading from them—or is delayed during the timeout cleanup phase—the child process will block indefinitely, waiting for the buffer to clear [4][6]. This can result in the timeout mechanism failing to raise or clean up as expected [1][4]. 3. Timeout Limitations: The timeout argument in subprocess.run is passed internally to Popen.communicate [7][8][5]. While it attempts to kill the direct child process upon expiration, the complexity of shell=True (where the shell process may not propagate signals correctly to its own children) and the potential for pipe-related blocking can lead to scenarios where the Python call appears to hang or fails to fully reclaim resources [1][2]. To mitigate these issues: - Avoid shell=True whenever possible to maintain a direct process tree and better signal control [7][8]. - Use start_new_session=True in Popen to create a new process group, allowing you to kill the entire group (e.g., using os.killpg) upon a timeout [3]. - If you must use shell=True and require robust cleanup, consider using the Popen constructor directly to manually manage the communication and signal termination of the process group [3][2]. - For large volumes of output, avoid capturing all output in memory via PIPE. Instead, stream the output to files or handle it in a separate thread to prevent pipe buffer deadlocks [4][9][10].

Citations:

1: https://bugs.python.org/issue37424

2: https://stackoverflow.com/questions/74524193/how-ensure-subprocess-is-killed-on-timeout-when-using-run

3: https://alexandra-zaharia.github.io/posts/kill-subprocess-and-its-children-on-timeout-python/

4: https://docs.python.org/release/3.11.5/library/subprocess.html

5: https://github.com/python/cpython/blob/0274d830/Doc/library/subprocess.rst

6: https://www.reddit.com/r/Python/comments/1vbie0/subprocesspipe_will_hang_indefinitely_if_stdout/

7: https://docs.python.org/3/library/subprocess.html

8: https://docs.python.org/3.10/library/subprocess.html

9: https://lucadrf.dev/blog/python-subprocess-buffers/

10: https://dev.to/parvagarwal/how-i-stopped-losing-gpu-training-runs-during-long-experiments-3ham

🏁 Script executed:

#!/bin/bash set -euo pipefail python3 - <<'PY' path = "backend/cliff/agents/runtime/tools/bash.py" with open(path, "r", encoding="utf-8") as f: lines = f.readlines() for i in range(1, 55): if i <= len(lines): print(f"{i:4d}: {lines[i-1].rstrip()}") PY

Repository: cliff-security/cliff

Length of output: 2459

Use a managed process group + bounded streaming for timeouts and output limiting

In backend/cliff/agents/runtime/tools/bash.py, _run() uses subprocess.run(..., shell=True, timeout=...); on TimeoutExpired the code reports the command was “killed”, but subprocess.run primarily terminates the shell process—child/grandchild processes can still survive unless you kill the whole process group/session.

_trim_output() only trims after subprocess.run(..., capture_output=True) has fully buffered all stdout/stderr in memory, so a noisy command can still consume unbounded memory even though you later keep only the last 200 lines.

Implement via Popen/asyncio subprocess with start_new_session/setsid, stream output into a bounded buffer, and on timeout kill the entire process group.

🧰 Tools

🪛 OpenGrep (1.22.0)

[ERROR] 63-73: Dynamic command passed to subprocess with shell=True. Use a command list without shell=True, or use shlex.quote() to sanitize input.

(coderabbit.command-injection.python-shell-true)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/cliff/agents/runtime/tools/bash.py` around lines 62 - 81, The current _run() uses subprocess.run(...) and asyncio.to_thread which buffers all output and may leave child processes alive on timeout; replace this with an asyncio-based spawn using asyncio.create_subprocess_shell/create_subprocess_exec (or subprocess.Popen with start_new_session=True) so the process is started in a new session/process group, stream stdout/stderr into bounded deques (or byte/line ring buffers) to enforce output limits instead of capture_output=True, and enforce the timeout with asyncio.wait_for; on timeout kill the whole process group (os.killpg(process.pid, signal.SIGTERM) then SIGKILL if needed) and return the trimmed buffered output. Update references in this file: replace uses of _run, subprocess.run, and asyncio.to_thread with the new async streaming implementation, keep using ctx.deps.workspace_dir, ctx.deps.env_vars, and _BASH_TIMEOUT_SECONDS, and ensure _trim_output is replaced/rewired to use the bounded buffers you populate incrementally.

Deferring to a follow-up. Process-group kill on timeout + bounded stdout streaming is real hardening, but it's orthogonal to the PA migration — the same behavior existed pre-migration (this is a verbatim port of the existing bash execution). Keeping PR #2 scoped to the substrate swap; the CE single-tenant trust model (the operator's own machine) makes it non-urgent. Flagging for the backlog rather than expanding this PR.

@galanko, understood — the rationale is solid: this is a verbatim port, the CE trust model reduces urgency, and keeping #2 scoped to the substrate swap is the right call.

Would you like me to open a GitHub issue to track process-group kill on timeout + bounded stdout streaming for the bash tool? I can include a backlink to this PR and comment so it's easy to pick up later.

coderabbitai · 2026-06-01T13:23:26Z

+        command = cfg["command"]
+        if not command:
+            raise ValueError("local MCP config has empty command")
+        with _suppress_mcp_v2_deprecation():
+            return MCPServerStdio(
+                command=command[0],
+                args=list(command[1:]),
+                env=cfg.get("environment") or None,
+                tool_prefix=entry_id,
+            )
+
+    if server_type == "remote":
+        url = cfg["url"]
+        with _suppress_mcp_v2_deprecation():
+            return MCPServerStreamableHTTP(
+                url=url,
+                headers=cfg.get("headers") or None,
+                tool_prefix=entry_id,
+            )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate MCP config value types before constructing toolsets.

command and url are used without type checks. A malformed command string is currently treated as a char array and produces an invalid tool invocation instead of being skipped as malformed.

Proposed fix

def _build_one(entry_id: str, cfg: dict[str, Any]) -> AbstractToolset[Any] | None: server_type = cfg.get("type", "local") if server_type == "local": command = cfg["command"] - if not command: + if ( + not isinstance(command, (list, tuple)) + or not command + or not all(isinstance(part, str) and part for part in command) + ): raise ValueError("local MCP config has empty command") with _suppress_mcp_v2_deprecation(): return MCPServerStdio( command=command[0], args=list(command[1:]), env=cfg.get("environment") or None, tool_prefix=entry_id, ) if server_type == "remote": url = cfg["url"] + if not isinstance(url, str) or not url.strip(): + raise ValueError("remote MCP config has invalid url") with _suppress_mcp_v2_deprecation(): return MCPServerStreamableHTTP( url=url, headers=cfg.get("headers") or None, tool_prefix=entry_id, )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/cliff/agents/runtime/tools/mcp.py` around lines 85 - 103, Validate the MCP config types before constructing MCPServerStdio or MCPServerStreamableHTTP: ensure `command` is a non-empty list/tuple of strings (not a single string) before using command[0] and args, and ensure `url` is a string before constructing the remote server; if the values are malformed, raise a ValueError with a clear message (including `entry_id`) so callers can skip the malformed toolset instead of invoking an invalid tool.

Commit 9706d20 addressed this comment by adding explicit type and non-empty checks for the local MCP command config before using command[0], preventing malformed commands from being treated as char sequences even though remote URL validation remains to be added.

Skipped: comment is from another GitHub bot.

coderabbitai · 2026-06-01T13:23:26Z

+def _make_flaky_run(
+    error_attempts: int, *, status_code: int = 429, always_error: bool = False
+):


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Type the new helper callables explicitly.

The new flaky-run/backoff helpers should have explicit parameter and return types for strict backend typing compliance.

Suggested typing for new helpers

@@ from __future__ import annotations import json import time from pathlib import Path # noqa: TC003 — runtime use in fixture annotation +from collections.abc import Awaitable, Callable +from typing import Any @@ def _make_flaky_run( error_attempts: int, *, status_code: int = 429, always_error: bool = False -): +) -> tuple[Callable[[str, Any, Any], Awaitable[dict[str, Any]]], dict[str, int]]: @@ - async def _run(agent_type, deps, model): # noqa: ARG001 + async def _run(agent_type: str, deps: Any, model: Any) -> dict[str, Any]: # noqa: ARG001 @@ -def _zero_backoff(monkeypatch): +def _zero_backoff(monkeypatch: pytest.MonkeyPatch) -> None:

As per coding guidelines, backend/**/*.py: "Use Python 3.11+ with strict type hints and Pydantic models for backend code".

Also applies to: 102-103, 113-116

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/tests/integration/test_rate_limit_backoff.py` around lines 91 - 93, The helper factory _make_flaky_run should declare full type hints: annotate parameters (error_attempts: int, status_code: int = 429, always_error: bool = False) and specify a precise return type (e.g., Callable[..., Coroutine[Any, Any, ReturnType]] or Callable[[...], Awaitable[ResponseType]] depending on what the helper returns); update the inner helper callables referenced around lines 102-103 and 113-116 similarly so every helper parameter and return is explicitly typed (use typing.Callable, typing.Awaitable/Coroutine, and concrete response/result types from the test helpers or HTTP client types) to satisfy strict Python 3.11 backend typing rules.

Declining. This repo's ruff config doesn't enable the ANN ruleset (select = ["E","F","I","N","UP","B","SIM","TCH"]), and the existing test helpers/fixtures are untyped by convention. Annotating only the new helpers would be inconsistent with the surrounding suite. Enforced test typing would be a repo-wide convention change (enable ANN + backfill), which is a separate cleanup — out of scope for this migration PR.

Understood — the ruff ANN ruleset is not enabled, test helpers are intentionally left untyped, and a partial annotation would break consistency. I'll note this for future reviews.

Would you like me to resolve this comment?

(´• ω •`) 🐇

✏️ Learnings added

Learnt from: galanko Repo: cliff-security/cliff PR: 251 File: backend/tests/integration/test_rate_limit_backoff.py:91-93 Timestamp: 2026-06-01T14:30:39.014Z Learning: In the `cliff-security/cliff` repository, the ruff linter configuration uses `select = ["E","F","I","N","UP","B","SIM","TCH"]` — the `ANN` (annotation) ruleset is intentionally excluded. Test helpers and fixtures are untyped by convention; adding type annotations only to individual new test helpers is considered inconsistent with the surrounding suite. Enforcing test-level type annotations would be a repo-wide cleanup requiring `ANN` enablement plus a full backfill — do not flag missing type hints on test helper functions in this repo.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2026-06-01T13:23:26Z

+def _permission_route_patches(*, pending: bool):
+    """Patch the route's get_agent_run / get_workspace / env-resolver.

-        resp = await client.post(
-            "/api/workspaces/ws-1/agent-runs/run-1/permission",
-            json={"approved": True},
-        )
-
-        assert resp.status_code == 200
-        body = resp.json()
-        assert body["status"] == "approved"
-        assert body["agent_run_id"] == "run-1"
-
-    @pytest.mark.asyncio
-    async def test_deny_calls_executor_deny_tool(self, app, client):
-        executor = app.state.agent_executor
-        executor.deny_tool = lambda run_id: True
-
-        resp = await client.post(
-            "/api/workspaces/ws-1/agent-runs/run-1/permission",
-            json={"approved": False},
-        )
-
-        assert resp.status_code == 200
-        body = resp.json()
-        assert body["status"] == "denied"
-        assert body["agent_run_id"] == "run-1"
-
-    @pytest.mark.asyncio
-    async def test_no_pending_returns_404(self, app, client):
-        executor = app.state.agent_executor
-        executor.approve_tool = lambda run_id: False
-        executor.deny_tool = lambda run_id: False
-
-        resp = await client.post(
-            "/api/workspaces/ws-1/agent-runs/gone/permission",
-            json={"approved": True},
-        )
-
-        assert resp.status_code == 404
-        assert "No pending permission request" in resp.json()["detail"]
-
-    @pytest.mark.asyncio
-    async def test_approve_routed_to_approve_not_deny(self, app, client):
-        """Trust guard — make sure ``approved=true`` doesn't accidentally
-        wire to deny_tool. Regression catcher for the conditional in
-        ``respond_to_permission``."""
-        approve_calls = []
-        deny_calls = []
-
-        executor = app.state.agent_executor
-        executor.approve_tool = lambda run_id: approve_calls.append(run_id) or True
-        executor.deny_tool = lambda run_id: deny_calls.append(run_id) or True
-
-        resp = await client.post(
-            "/api/workspaces/ws-1/agent-runs/run-77/permission",
-            json={"approved": True},
-        )
+    ``pending`` controls whether the agent_run reports a pending permission
+    request (the 404 gate keys off it).
+    """


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add strict type hints to the newly introduced backend test helpers.

The new helper and async test signatures should be explicitly typed to meet the backend strict-typing requirement.

Type-hint pass for new signatures

@@ -def _permission_route_patches(*, pending: bool): +def _permission_route_patches(*, pending: bool) -> tuple[Any, Any, Any]: @@ - async def test_approve_resumes_executor(self, app, client): + async def test_approve_resumes_executor( + self, app: FastAPI, client: AsyncClient + ) -> None: @@ - async def _resume(db, ws, rid, *, approved, workspace_dir, - deny_message=None, env_vars=None): + async def _resume( + db: Any, + ws: Any, + rid: str, + *, + approved: bool, + workspace_dir: str, + deny_message: str | None = None, + env_vars: dict[str, str] | None = None, + ) -> None: @@ - async def test_deny_resumes_executor_with_denied(self, app, client): + async def test_deny_resumes_executor_with_denied( + self, app: FastAPI, client: AsyncClient + ) -> None: @@ - async def test_no_pending_returns_404(self, app, client): + async def test_no_pending_returns_404( + self, app: FastAPI, client: AsyncClient + ) -> None:

As per coding guidelines, backend/**/*.py: "Use Python 3.11+ with strict type hints and Pydantic models for backend code".

Also applies to: 746-752, 775-780, 801-801

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/tests/test_routes_agent_execution.py` around lines 709 - 714, Annotate the new helper and the async test signatures with explicit type hints: change def _permission_route_patches(*, pending: bool) to include a return type (for example def _permission_route_patches(*, pending: bool) -> Iterator[None] and import Iterator from typing, or use ContextManager[Any] if it yields a context manager), and ensure every async test function in this file has an explicit return annotation (e.g. async def test_xyz(...) -> None). Also add missing parameter type annotations for any other params referenced in those helpers (e.g. any mocked objects) so all new helper and test signatures are fully typed per backend strict-typing rules.

Declining. This repo's ruff config doesn't enable the ANN ruleset (select = ["E","F","I","N","UP","B","SIM","TCH"]), and the existing test helpers/fixtures are untyped by convention. Annotating only the new helpers would be inconsistent with the surrounding suite. Enforced test typing would be a repo-wide convention change (enable ANN + backfill), which is a separate cleanup — out of scope for this migration PR.

@galanko, understood — that's a fair call. Ruff not enabling ANN makes partial annotation noise rather than signal, and consistency with the existing untyped test suite is the right call here.

(°–°) 🐇

✏️ Learnings added

Learnt from: galanko Repo: cliff-security/cliff PR: 251 File: backend/tests/test_routes_agent_execution.py:0-0 Timestamp: 2026-06-01T14:30:37.773Z Learning: In the cliff-security/cliff repository, test helpers and pytest fixtures are intentionally untyped by convention. The ruff config (backend/pyproject.toml) uses `select = ["E","F","I","N","UP","B","SIM","TCH"]` and does not enable the `ANN` ruleset. Suggesting type annotations on new test helpers/fixtures should be avoided, as it would be inconsistent with the surrounding suite. Enabling `ANN` and backfilling annotations is a separate repo-wide cleanup effort, out of scope for individual feature PRs.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

galanko · 2026-06-01T13:35:47Z

baz review

baz-reviewer · 2026-06-01T13:42:42Z

+def classify_tool_request(tool: str, patterns: list[str]) -> str:
+    """Return ``"auto"``, ``"ask"``, or ``"deny"`` for an executor tool call.
+
+    Ported verbatim from ``executor._classify_tool_request``.
+


classify_tool_request duplicates executor._classify_tool_request, so should we keep a single helper and have executor/tests import it to avoid drift when the deny-list or ask paths change?

Want Baz to fix this for you? Activate Fixer

Commit 9706d20 addressed this comment by removing the executor-specific _classify_tool_request and now keeping the deny/ask/denylist classifier only in backend/cliff/agents/runtime/tools/permissions.py, which is used by the new gate_tool_call enforcement. The old OpenCode permission queue path (and its duplicated classifier) was replaced with the Pydantic AI executor path, so there is no longer a second copy to drift apart.

Intentional and temporary during the migration — called out in the permissions.py module docstring. The OpenCode-era executor._classify_tool_request is deleted in PR #3 of this stack (chore/drop-opencode-substrate), at which point permissions.classify_tool_request is the single source. Both copies are byte-identical and covered by the same tests until then, so there's no drift window in practice. Consolidating now would mean importing the new module back into code that's about to be deleted.

baz-reviewer · 2026-06-01T13:42:42Z

+    approvals = reqs.approvals
+    tool_call_ids = [p.tool_call_id for p in approvals]
+    primary = approvals[0]
+    meta = reqs.metadata.get(primary.tool_call_id, {}) if reqs.metadata else {}


DeferredToolRequests.approvals can be empty, so approvals[0] may raise IndexError and skip _park_for_permission(); should we handle the empty-list case and build the marker from whatever tool-call data is present instead of indexing the first approval?

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In backend/cliff/agents/executor.py around lines 651-663, in the `_build_permission_marker(reqs: DeferredToolRequests)` logic, stop unconditionally doing `primary = approvals[0]` and handling `approvals` as if it is always non-empty. Refactor `_build_permission_marker` to explicitly handle the `approvals == []` case by building a fallback `permission_request` marker without indexing (prefer inferring `tool_call_ids` from `reqs.metadata` when available; otherwise use an empty `tool_call_ids` list and safe defaults like `tool="unknown"` and `patterns=[]`). This ensures `_run_pa_executor` can return `_PaExecutorOutcome(permission_request=...)` and `_park_for_permission()` persists `permission_pending`/`pa_message_history`, so the run ends up in `awaiting_permission` rather than failing with `IndexError`.

Commit 9706d20 addressed this comment by refactoring _build_permission_marker to guard when DeferredToolRequests.approvals is empty, avoiding approvals[0] and instead returning a safe fallback marker (with default tool, patterns, and inferred tool_call_ids) so the permission flow can pause without crashing.

Fixed in 9706d20. _build_permission_marker no longer indexes unconditionally — it guards the empty-approvals case and builds a fallback marker (tool='unknown', patterns=[], tool_call_ids=[]) so an approvals-empty DeferredToolRequests parks cleanly instead of raising IndexError. In practice Cliff's gated tools always raise ApprovalRequired (→ non-empty approvals); the guard covers PA's external-calls mechanism, which Cliff doesn't use. Covered by two new unit tests in test_deferred_tools_persist.py.

Thanks for the update—guarding _build_permission_marker when there are no approvals and adding the fallback marker keeps the pause/resume flow safe even for PA’s external pause API.

baz-reviewer · 2026-06-01T13:42:42Z

    run_id: str,
    body: PermissionDecision,
    request: Request,
+    db=Depends(get_db),
 ):
-    """Approve or deny a pending tool-use permission request."""
+    """Approve or deny a paused executor tool call, resuming the run.
+
+    ADR-0047 / PR #2 — the remediation_executor pauses on a gated tool by
+    persisting a ``DeferredToolRequests`` marker (``agent_run.permission_
+    request`` + ``pa_message_history``). This endpoint resumes the run via
+    ``executor.resume_executor`` with the user's decision. Resume actually
+    re-runs the agent (it can take minutes), so it runs as a background


Can we open a fresh DB session inside _resume_in_background instead of closing over db=Depends(get_db), otherwise the request-scoped connection is gone by the time executor.resume_executor writes and the background task can hit ProgrammingError/DatabaseError?

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In backend/cliff/api/routes/agent_execution.py around lines 515-561, in the `permission_decision` POST handler, you create an asyncio background task (`_resume_in_background`) that closes over the request-scoped `db` dependency (`db=Depends(get_db)`). Refactor so the background task does NOT use the request’s `db` session; instead, open a fresh DB session inside `_resume_in_background` (e.g., call `get_db` again / use a session factory) and ensure it’s closed after `executor.resume_executor` completes. This guarantees the resume/update logic runs with a valid connection even after the HTTP response returns.

Not applicable to Cliff's connection model. get_db yields a single process-global aiosqlite connection opened once at init_db() and closed only at shutdown (close_db()) — there's no per-request session that gets torn down when the response returns (see cliff/db/connection.py: the dependency just yield _dbs the shared singleton with no teardown). So the closure in _resume_in_background stays valid for the life of the process. This matches the existing background-task pattern in execute/run-all, which also close over db.

Thanks, that makes sense given Cliff’s process-global aiosqlite connection model. I’ll save this to memory once the PR is merged.

baz-reviewer · 2026-06-01T13:42:42Z

+    run = await get_agent_run(db, run_id)
+    if run is None or not run.permission_pending:
        raise HTTPException(
            status_code=404,
            detail="No pending permission request for this agent run",
        )
+    workspace = await get_workspace(db, workspace_id)
+    if workspace is None or not workspace.workspace_dir:
+        raise HTTPException(status_code=404, detail="Workspace not found")


respond_to_permission should verify run.workspace_id == workspace_id before executor.resume_executor, otherwise a paused run can resume in the wrong workspace context — should we add the same ownership guard as cancel_agent_run?

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In backend/cliff/api/routes/agent_execution.py around lines 532-540 inside the `respond_to_permission` endpoint, add a workspace/run ownership guard: after `run = await get_agent_run(db, run_id)` and `workspace = await get_workspace(db, workspace_id)`, verify that `run.workspace_id == workspace_id` (or the equivalent field on the run model). If they don’t match, raise an HTTPException (404 is consistent with the other “not found pending request” behavior) and do not schedule `_resume_in_background`. This prevents a client from resuming a paused run from a different workspace with the wrong `workspace_dir`/env vars.

Commit 9706d20 addressed this comment by adding a workspace ownership guard in respond_to_permission, ensuring run.workspace_id == workspace_id before resuming the executor and returning 404 when the run belongs to another workspace.

baz-reviewer · 2026-06-01T13:42:42Z

+                deny_message=body.deny_message,
+                env_vars=env_vars,
+            )
        except Exception:
            logger.exception(


GET /workspaces/{workspace_id}/agent-execution/stream looks like the only SSE path for agent_run_started/completed, permission_request, and done; should we keep it behind a deprecation window or document the agent-runs replacement in the public API/changelog for external EventSource consumers?

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In `backend/cliff/api/routes/agent_execution.py` around lines 552-648 where the `stream_agent_execution` GET SSE endpoint (`/workspaces/{workspace_id}/agent-execution/stream`) was removed, re-add a compatibility shim instead of dropping the only channel that emits `agent_run_started`, `agent_run_completed`, `permission_request`, and `done`. Implement the endpoint again (or keep it behind a short deprecation window/feature flag) so existing EventSource clients still receive the same event shapes, and if it will be removed later, return a clear `Deprecation`/`Sunset`-style response header or an explicit message pointing to the polling replacement (`agent-runs`). Also add a public migration note in the API changelog/docs mentioning the deprecation window and the replacement path for external SSE consumers.

Won't-fix for the community edition. Cliff CE is single-user and self-hosted — the React frontend is the sole consumer of this endpoint, and it's switched to agent-runs polling in this same PR (PR2.D). There are no external EventSource clients to keep a deprecation window for, so the channel is removed rather than sunset. The SSE removal + polling replacement will be noted in the release CHANGELOG. A hardened multi-tenant deployment with third-party SSE consumers would warrant a deprecation window, but that's out of scope here.

Thanks, that makes sense for CE. I'll save this to memory once the PR is merged.

Security / correctness: - read tool: refuse paths resolving outside the workspace (was reading arbitrary host files); add shared escapes_workspace() helper in permissions.py, used by both read and edit. - edit tool: fix relative_to() crash on an approved escaping write — fall back to absolute path in the confirmation message. - webfetch: stream the response, enforce the content-type allowlist from headers before reading any body, and stop at _MAX_BODY_BYTES so a huge or binary body can't blow memory/context. - agent-execution permission route: add workspace-ownership check so a run parked in another workspace can't be resumed via this workspace's URL (and with its env). - mcp adapter: reject a non-list `command` (a string would be split into characters) with a clear TypeError; skipped, not mangled. - executor: guard the typed-Optional parse_result so a contract violation surfaces as a failed run, not an AttributeError in _finalize_run. Tests: - edit: split the misnamed "write outside" test into a real approved-escaping-write case (exercises the relative_to fallback) and an inside-write case. - read: refuse-escape test (symlink) asserting host contents don't leak. - mcp: string-command-skipped test. - webfetch: oversize-body truncation test. - permission route: replace sleep(0.02) with a deterministic asyncio.Event wait; add cross-workspace-ownership 404 test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

baz-reviewer flagged that `approvals[0]` would IndexError if a DeferredToolRequests arrives with an empty `approvals` list. Cliff's gated tools always raise ApprovalRequired (→ non-empty approvals), but PA also models external deferred `calls` (a mechanism Cliff doesn't use); guard the empty case so the run parks cleanly with a fallback marker instead of crashing out of the dispatch path. Tests: marker built from a real approval; marker from an empty DeferredToolRequests returns safe defaults without indexing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

galanko · 2026-06-01T14:30:44Z

Review round addressed — `9706d20`

Thanks both bots. Summary of how each finding was handled.

Fixed (commits `f546349` + `9706d20`)

Finding	Resolution
`read` reads arbitrary host files	Refuse paths resolving outside the workspace; shared `escapes_workspace()` helper
`edit` `relative_to()` crash on approved escaping write	Fall back to absolute path in the confirmation message
`webfetch` buffers full body before checks	Stream; enforce content-type from headers first; stop at the byte cap
permission route missing ownership check	`run.workspace_id != workspace_id` → 404
`mcp` `command` not validated	Reject non-list `command` with a clear `TypeError`; `Iterator[None]` annotation
`executor` unguarded `parse_result`	Guard the typed Optional → failed run, not `AttributeError`
`_build_permission_marker` `approvals[0]` IndexError	Guard empty-`approvals`; fallback marker
edit "approved-outside" test didn't exercise the path	Split into real escaping-write + inside-write cases
sleep-based test sync	Deterministic `asyncio.Event`

New tests: read refuse-escape (symlink, no host-content leak), mcp string-command-skipped, webfetch oversize-truncation, cross-workspace-ownership 404, _build_permission_marker normal + empty. 93 affected tests pass, ruff clean.

Declined / deferred (rationale on each thread)

SSE deprecation window — CE is single-user self-hosted; the frontend is the sole consumer and was migrated to agent-runs polling in this PR. No external EventSource clients to sunset for. Noted for the CHANGELOG.
Fresh DB session in the background task — get_db yields a process-global singleton connection (closed only at shutdown), so the closure stays valid; matches the existing execute/run-all pattern.
awaiting_permission in AgentRunStatus — never written to the DB as a status; it's only the in-memory AgentExecutionResult.status. The parked row stays running + permission_pending.
Duplicate classifier in executor.py — intentional during the migration; the OpenCode copy is deleted in PR Switch license from MIT to AGPL-3.0 for open-core model #3 of this stack. Documented in the permissions.py docstring.
bash process-group kill + bounded streaming — real hardening but orthogonal to the migration (verbatim port of existing behavior); backlog follow-up.
Strict type hints on test helpers — ANN isn't in the repo's ruff select; test helpers are untyped by convention. A repo-wide typing change is separate cleanup.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

backend/tests/agents/tools/test_tools.py (1)

20-27: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider adding type hints to the _ctx helper.

The workspace_dir and env_vars parameters lack type annotations. While test helpers often have relaxed typing, adding hints improves maintainability.

-def _ctx(workspace_dir, *, approved: bool = False, env_vars=None) -> SimpleNamespace:
+def _ctx(
+    workspace_dir: str | Path,
+    *,
+    approved: bool = False,
+    env_vars: dict[str, str] | None = None,
+) -> SimpleNamespace:

As per coding guidelines, "Use Python 3.11+ with strict type hints and Pydantic models for backend code".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/tests/agents/tools/test_tools.py` around lines 20 - 27, Add Python
type hints to the _ctx helper: annotate workspace_dir as pathlib.Path | str (or
simply Path) and env_vars as dict[str, str] | None, and annotate the return type
as SimpleNamespace. Update the function signature for _ctx and add any necessary
imports (e.g., from pathlib import Path and from typing import Optional, Dict)
so the signature reads something like _ctx(workspace_dir: Path | str, *,
approved: bool = False, env_vars: Optional[Dict[str, str]] = None) ->
SimpleNamespace, keeping the existing body and behavior unchanged and
referencing the same WorkspaceDeps construction.

backend/cliff/agents/runtime/tools/mcp.py (1)

103-110: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate url type before constructing MCPServerStreamableHTTP.

The past review flagged that url should be validated as a non-empty string, but that check was not added. A malformed url (e.g., None, empty string, or wrong type) will produce an invalid remote toolset or raise downstream.

🛡️ Proposed fix to add url validation

     if server_type == "remote":
         url = cfg["url"]
+        if not isinstance(url, str) or not url.strip():
+            raise ValueError(
+                f"remote MCP config 'url' must be a non-empty string, got "
+                f"{type(url).__name__}"
+            )
         with _suppress_mcp_v2_deprecation():

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/cliff/agents/runtime/tools/mcp.py` around lines 103 - 110, When
server_type == "remote" validate cfg["url"] is a non-empty string before calling
MCPServerStreamableHTTP: retrieve url = cfg["url"], check isinstance(url, str)
and url.strip() is not empty, and if the check fails raise a clear ValueError
(or log and raise) indicating an invalid remote url for entry_id; only then call
_suppress_mcp_v2_deprecation() and construct MCPServerStreamableHTTP(url=url,
headers=cfg.get("headers") or None, tool_prefix=entry_id).

♻️ Duplicate comments (3)

backend/tests/test_routes_agent_execution.py (2)

746-848: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add strict return type annotations to async test methods.

The async test methods in TestPermissionEndpoint lack explicit -> None return annotations. As per coding guidelines, backend code must use Python 3.11+ with strict type hints.

♻️ Proposed fix to add return annotations

     `@pytest.mark.asyncio`
-    async def test_approve_resumes_executor(self, app, client):
+    async def test_approve_resumes_executor(
+        self, app: FastAPI, client: AsyncClient
+    ) -> None:

     `@pytest.mark.asyncio`
-    async def test_deny_resumes_executor_with_denied(self, app, client):
+    async def test_deny_resumes_executor_with_denied(
+        self, app: FastAPI, client: AsyncClient
+    ) -> None:

     `@pytest.mark.asyncio`
-    async def test_no_pending_returns_404(self, app, client):
+    async def test_no_pending_returns_404(
+        self, app: FastAPI, client: AsyncClient
+    ) -> None:

     `@pytest.mark.asyncio`
-    async def test_run_owned_by_other_workspace_returns_404(self, app, client):
+    async def test_run_owned_by_other_workspace_returns_404(
+        self, app: FastAPI, client: AsyncClient
+    ) -> None:

As per coding guidelines, backend/**/*.py: Use Python 3.11+ with strict type hints and Pydantic models for backend code.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/tests/test_routes_agent_execution.py` around lines 746 - 848, The
async test methods in TestPermissionEndpoint (e.g.,
test_approve_resumes_executor, test_deny_resumes_executor_with_denied,
test_no_pending_returns_404, test_run_owned_by_other_workspace_returns_404) are
missing explicit return type annotations; update each async def to include ->
None (e.g., async def test_approve_resumes_executor(self, app, client) -> None:)
to satisfy the strict Python 3.11+ typing guideline throughout the test class.

709-743: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add strict type hints to the permission route helper.

The helper _permission_route_patches lacks an explicit return type annotation. As per coding guidelines, backend code must use Python 3.11+ with strict type hints.

♻️ Proposed fix to add return type

+from typing import Iterator
+from unittest.mock import _patch

-def _permission_route_patches(*, pending: bool, run_workspace_id: str = "ws-1"):
+def _permission_route_patches(
+    *, pending: bool, run_workspace_id: str = "ws-1"
+) -> tuple[_patch[Any], _patch[Any], _patch[Any]]:

As per coding guidelines, backend/**/*.py: Use Python 3.11+ with strict type hints and Pydantic models for backend code.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/tests/test_routes_agent_execution.py` around lines 709 - 743, The
helper _permission_route_patches has no return type annotation; add a strict
type hint on its signature like def _permission_route_patches(*, pending: bool,
run_workspace_id: str = "ws-1") -> tuple[object, object, object]: to explicitly
declare it returns a 3-tuple of patch objects (or replace object with a more
specific mock type if you prefer), and keep the body unchanged; this satisfies
the backend strict typing rule while preserving references to AgentRun,
SimpleNamespace, AsyncMock and patch used inside the function.

backend/cliff/agents/executor.py (1)

1636-1649: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard outcome.parse_result before calling _finalize_run in resume path.

The resume_executor path calls _finalize_run with outcome.parse_result (line 1644) without verifying it is not None. The _PaExecutorOutcome contract allows parse_result=None, and a contract violation would crash _finalize_run with an AttributeError instead of surfacing as a failed run. The execute() path added this guard (lines 851-855), but the resume path is still missing it.

🐛 Proposed fix to add guard

             if outcome.permission_request is not None:
                 return await self._park_for_permission(
                     db, workspace_id, agent_run, outcome, start_time
                 )
+            if outcome.parse_result is None:
+                raise AgentProcessError(
+                    "executor outcome missing parse_result for completed resume"
+                )
             return await self._finalize_run(
                 db,
                 workspace_id=workspace_id,
                 agent_type=agent_run.agent_type,
                 parse_result=outcome.parse_result,

As per coding guidelines, backend/**/*.py: Use Python 3.11+ with strict type hints and Pydantic models for backend code.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/cliff/agents/executor.py` around lines 1636 - 1649, The resume path
calls _finalize_run with outcome.parse_result without checking for None; update
the resume_executor flow so that before calling _finalize_run (the branch that
currently mirrors lines 1636-1649), you guard outcome.parse_result the same way
as in execute()—if outcome.parse_result is None, treat the run as failed/handle
appropriately (e.g., return a failed run result or raise a controlled error)
instead of passing None into _finalize_run; reference outcome.parse_result,
_finalize_run, and the resume path that currently calls _park_for_permission to
locate where to add the guard.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@backend/cliff/agents/runtime/tools/mcp.py`:
- Around line 103-110: When server_type == "remote" validate cfg["url"] is a
non-empty string before calling MCPServerStreamableHTTP: retrieve url =
cfg["url"], check isinstance(url, str) and url.strip() is not empty, and if the
check fails raise a clear ValueError (or log and raise) indicating an invalid
remote url for entry_id; only then call _suppress_mcp_v2_deprecation() and
construct MCPServerStreamableHTTP(url=url, headers=cfg.get("headers") or None,
tool_prefix=entry_id).

In `@backend/tests/agents/tools/test_tools.py`:
- Around line 20-27: Add Python type hints to the _ctx helper: annotate
workspace_dir as pathlib.Path | str (or simply Path) and env_vars as dict[str,
str] | None, and annotate the return type as SimpleNamespace. Update the
function signature for _ctx and add any necessary imports (e.g., from pathlib
import Path and from typing import Optional, Dict) so the signature reads
something like _ctx(workspace_dir: Path | str, *, approved: bool = False,
env_vars: Optional[Dict[str, str]] = None) -> SimpleNamespace, keeping the
existing body and behavior unchanged and referencing the same WorkspaceDeps
construction.

---

Duplicate comments:
In `@backend/cliff/agents/executor.py`:
- Around line 1636-1649: The resume path calls _finalize_run with
outcome.parse_result without checking for None; update the resume_executor flow
so that before calling _finalize_run (the branch that currently mirrors lines
1636-1649), you guard outcome.parse_result the same way as in execute()—if
outcome.parse_result is None, treat the run as failed/handle appropriately
(e.g., return a failed run result or raise a controlled error) instead of
passing None into _finalize_run; reference outcome.parse_result, _finalize_run,
and the resume path that currently calls _park_for_permission to locate where to
add the guard.

In `@backend/tests/test_routes_agent_execution.py`:
- Around line 746-848: The async test methods in TestPermissionEndpoint (e.g.,
test_approve_resumes_executor, test_deny_resumes_executor_with_denied,
test_no_pending_returns_404, test_run_owned_by_other_workspace_returns_404) are
missing explicit return type annotations; update each async def to include ->
None (e.g., async def test_approve_resumes_executor(self, app, client) -> None:)
to satisfy the strict Python 3.11+ typing guideline throughout the test class.
- Around line 709-743: The helper _permission_route_patches has no return type
annotation; add a strict type hint on its signature like def
_permission_route_patches(*, pending: bool, run_workspace_id: str = "ws-1") ->
tuple[object, object, object]: to explicitly declare it returns a 3-tuple of
patch objects (or replace object with a more specific mock type if you prefer),
and keep the body unchanged; this satisfies the backend strict typing rule while
preserving references to AgentRun, SimpleNamespace, AsyncMock and patch used
inside the function.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: f50b62be-69c0-463d-bca7-116d1d15ff26

📥 Commits

Reviewing files that changed from the base of the PR and between c25f096 and 9706d20.

📒 Files selected for processing (11)

backend/cliff/agents/executor.py
backend/cliff/agents/runtime/tools/edit.py
backend/cliff/agents/runtime/tools/mcp.py
backend/cliff/agents/runtime/tools/permissions.py
backend/cliff/agents/runtime/tools/read.py
backend/cliff/agents/runtime/tools/webfetch.py
backend/cliff/api/routes/agent_execution.py
backend/tests/agents/test_deferred_tools_persist.py
backend/tests/agents/tools/test_mcp.py
backend/tests/agents/tools/test_tools.py
backend/tests/test_routes_agent_execution.py

galanko and others added 6 commits May 31, 2026 23:10

Merge branch 'main' into feat/pa-migrate-executor

c25f096

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

baz-reviewer Bot reviewed Jun 1, 2026

View reviewed changes

galanko and others added 2 commits June 1, 2026 17:24

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

galanko merged commit eb1a95e into main Jun 1, 2026
8 checks passed

galanko deleted the feat/pa-migrate-executor branch June 1, 2026 16:27

Conversation

galanko commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

What changed (by slice)

What did NOT change

Deviations from IMPL-0022 (verified against PA runtime)

Test plan

Skipped tests (un-skipped/removed in PR #3)

Stacked PRs

Summary by CodeRabbit

Generated description

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

galanko Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

galanko Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

galanko Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

galanko commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

coderabbitai Bot Jun 1, 2026 •

edited

Loading

coderabbitai Bot Jun 1, 2026 •

edited

Loading

coderabbitai Bot Jun 1, 2026 •

edited

Loading

coderabbitai Bot Jun 1, 2026 •

edited

Loading

coderabbitai Bot Jun 1, 2026 •

edited

Loading

Review round addressed — `9706d20`

Fixed (commits `f546349` + `9706d20`)