Skip to content

feat(agents): migrate the remediation_executor to Pydantic AI (PR #2 of 3)#251

Merged
galanko merged 9 commits into
mainfrom
feat/pa-migrate-executor
Jun 1, 2026
Merged

feat(agents): migrate the remediation_executor to Pydantic AI (PR #2 of 3)#251
galanko merged 9 commits into
mainfrom
feat/pa-migrate-executor

Conversation

@galanko
Copy link
Copy Markdown
Collaborator

@galanko galanko commented Jun 1, 2026

User description

PR #2 of three executing IMPL-0022 against ADR-0047. Moves the remediation_executor — the one tool-using agent — off the OpenCode subprocess and onto in-process Pydantic AI, with a durable human-in-the-loop approval cycle. PR #1 (#221 + #249) migrated the six no-tools agents; PR #3 deletes the leftover OpenCode engine/templates + the app-level normalizer.

Net: +2,718 / −2,127 — a ~2,400-line deletion of subprocess / SSE / permission-queue machinery, replaced by ~900 lines of in-process tools + a DB-marker park/resume cycle. Exactly ADR-0047's "simpler" thesis.

What changed (by slice)

Slice Detail
A — tool primitives backend/cliff/agents/runtime/tools/bash, edit, read, webfetch, gh as in-process @agent.tool functions, plus permissions.py (the auto/ask/deny tier classifier ported verbatim from the OpenCode-era _classify_tool_request + denylists). gate_tool_call maps a tier to PA's HITL vocabulary: denyModelRetry (model pivots), askApprovalRequired until approved, auto→run.
B — executor agent runtime/remediation_executor.pyAgent(output_type=[RemediationExecutorOutput, DeferredToolRequests], tools=…, toolsets=<mcp>). SYSTEM_PROMPT lifted from the Jinja template (all seven hard rules preserved verbatim — each = a real shipped regression). runtime/tools/mcp.py adapts Cliff's resolved MCP configs to PA toolsets.
C — dispatch + HITL executor.execute() routes remediation_executor through _run_pa_executor. A gated tool → DeferredToolRequests output → migration 023 persists permission_request + pa_message_history on the agent_run row, returns the new awaiting_permission status. POST .../permission resumes via resume_executor (rebuild agent → deserialize history → ToolApproved/ToolDenied → re-enter or re-park). Shared _finalize_run / _resolve_active_model / _run_pa_call extracted so initial-run and resume reach identical persistence + retry.
D — delete frontend SSE The /agent-execution/stream route + the IssueSidePanel EventSource were only ever a refetch nudge over the polled agent-runs query (which is the source of truth and renders the prompt from the durable marker). Both deleted; the frontend relies on the 3s/5s poll.
E — delete OpenCode machinery Removed _send_and_collect, _collect_response, approve_tool/deny_tool/push_permission_event/ensure_permission_queue/_patch_permission_marker, _PendingApproval, the dead string-based rate-limit classifier, and the in-memory permission queue. _active_runs/check_not_busy stay (run-tracking).

What did NOT change

Deviations from IMPL-0022 (verified against PA runtime)

  • denyModelRetry, not ValueError. PA propagates a raw tool ValueError (crashes the run); ModelRetry is what PA converts to a model-visible retry — matching ADR-0047's stated "the model gets a deterministic error and pivots." (IMPL text wants a one-line update.)
  • Frontend SSE deleted, not retargeted. The IMPL's conservative "keep the wire, swap the producer" was over-built — the channel carried no data post-PA. Deleting it is the bigger, cleaner win.

Test plan

  • uv run ruff check cliff/ tests/ — clean
  • Full backend non-e2e suite — 1570 passed / 14 skipped (the only failure is the pre-existing wiz-s3-public LLM-eval flake, untouched by this branch)
  • New tests: test_deferred_tools_persist.py (real FunctionModel park→resume + corrupt-state→failed), test_remediation_executor.py (defer/deny/complete), test_executor_provider_chain.py (6 providers), tools/ (44), rate-limit retargeted to the PA _run_pa_call path
  • OpenAPI snapshot regenerated (drops the stream path, adds optional deny_message)
  • Live smoke on the installed build — migration 023 applied, PA executor path live, OpenCode/SSE machinery gone; cliffsec selftest ran the full no-tools pipeline in-process to the plan gate, no regression. Transcript: cliff-os/docs/qa/migration-smoke/PR2-executor/
  • /code-review (recall, 7 angles) — one confirmed reliability gap fixed (resume_executor no longer wedges a run on a corrupt pa_message_history); two findings refuted (DB lifetime, edit-approve semantics); cleanup deferred to PR Switch license from MIT to AGPL-3.0 for open-core model #3
  • Live HITL smoke docs: restructure roadmap into parallel stages #4 (rm -rf approval) + feat: Phase 3 — Persistence & Domain Model #5 (validator retry) — blocked on a GitHub re-auth (expired token → executor can't open a PR, independent of the migration). Park/resume is unit-covered; best finished interactively post-merge.

Skipped tests (un-skipped/removed in PR #3)

13 OpenCode-machinery tests carry explicit PR2.E skip reasons (same staged pattern PR #1 used). Their behaviour is replaced by test_deferred_tools_persist.py + the route tests; the dead machinery they exercise is deleted in PR #3.

Stacked PRs

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Agent runs now pause with an "awaiting_permission" state for gated tool calls and can be resumed via background/polled approval endpoints.
    • New in-process remediation tools (bash, edit, read, webfetch, gh) with safety gating and optional MCP toolset support.
    • Executor persists message history so paused runs can be resumed.
  • Bug Fixes

    • Removed reliance on SSE streams for permission flow; improves reliability and cleanup.
  • Tests

    • Added coverage for deferred-tool park/resume, tools, permissions, and rate-limit backoff.

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Migrate the remediation_executor path in AgentExecutor onto the remediation_executor Pydantic AI agent so tool-using runs execute in-process with deferred-tool outputs and HITL resume markers. Update the POST /permission resume endpoint, pipeline handling, migration 023, and frontend polling so permission decisions rely on the durable agent_run marker instead of the deleted SSE/subprocess plumbing.

TopicDetails
Executor PA Flow Orchestrate AgentExecutor to run the remediation_executor via _run_pa_executor/resume_executor, persist DeferredToolRequests markers and message history, and finalize the paused result through the POST /permission flow while the pipeline defers on awaiting_permission instead of failing.
Modified files (12)
  • backend/cliff/agents/executor.py
  • backend/cliff/agents/pipeline.py
  • backend/cliff/api/routes/agent_execution.py
  • backend/cliff/db/migrations/023_agent_run_pa_message_history.sql
  • backend/cliff/db/repo_agent_run.py
  • backend/tests/agents/test_deferred_tools_persist.py
  • backend/tests/api/openapi_snapshot.json
  • backend/tests/integration/test_permission_flow.py
  • backend/tests/integration/test_rate_limit_backoff.py
  • frontend/src/__tests__/IssueSidePanel.permission-prompt.test.tsx
  • frontend/src/api/client.ts
  • frontend/src/components/issues/IssueSidePanel.tsx
Latest Contributors(1)
UserCommitDate
galank@gmail.comfix(executor): harden ...June 01, 2026
Other Other files
Modified files (6)
  • backend/cliff/models/__init__.py
  • backend/pyproject.toml
  • backend/tests/agents/tools/__init__.py
  • backend/tests/test_executor.py
  • backend/tests/test_routes_agent_execution.py
  • frontend/src/__tests__/IssueSidePanel.sse-progress.test.tsx
Latest Contributors(1)
UserCommitDate
galank@gmail.comfeat(agents): persiste...June 01, 2026
Tool Primitives Implement the in-process tool primitives, permission-tier gate, MCP adapter, and focused tests so the remediation_executor agent can invoke bash/edit/read/webfetch/gh safely and retain the existing auto/ask/deny policy under Pydantic AI.
Modified files (13)
  • backend/cliff/agents/runtime/remediation_executor.py
  • backend/cliff/agents/runtime/tools/__init__.py
  • backend/cliff/agents/runtime/tools/bash.py
  • backend/cliff/agents/runtime/tools/edit.py
  • backend/cliff/agents/runtime/tools/gh.py
  • backend/cliff/agents/runtime/tools/mcp.py
  • backend/cliff/agents/runtime/tools/permissions.py
  • backend/cliff/agents/runtime/tools/read.py
  • backend/cliff/agents/runtime/tools/webfetch.py
  • backend/tests/agents/test_remediation_executor.py
  • backend/tests/agents/tools/test_mcp.py
  • backend/tests/agents/tools/test_permissions.py
  • backend/tests/agents/tools/test_tools.py
Latest Contributors(1)
UserCommitDate
galank@gmail.comfeat(agents): executor...June 01, 2026
Review this PR on Baz | Customize your next review

galanko and others added 6 commits May 31, 2026 23:10
First slice of PR #2 (ADR-0047 / IMPL-0022) — the five in-process tool
functions that replace OpenCode's tool dispatch for the
remediation_executor, plus the permission tiering they enforce.

New package backend/cliff/agents/runtime/tools/:
- permissions.py — classify_tool_request + _CATASTROPHIC_BASH +
  _GATED_BASH + _is_pipe_to_shell ported VERBATIM from the OpenCode-era
  executor classifier (same patterns, same tiers). gate_tool_call is the
  new thin layer that maps a tier to PA's HITL vocabulary: deny → raise
  ValueError (deterministic, model pivots), ask → raise ApprovalRequired
  until ctx.tool_call_approved, auto → proceed.
- bash.py — async subprocess via asyncio.to_thread (non-blocking),
  cwd=workspace_dir, 180s per-command timeout, output trimmed to last
  200 lines.
- edit.py — classifier gate + a second resolved-path containment check
  against workspace_dir (catches symlink / normalized-separator escapes
  the textual check misses).
- read.py — auto tier, 50 KB cap with truncation marker.
- webfetch.py — auto tier, 30s timeout, text/* + application/json
  allowlist, 50 KB cap. SSRF posture documented (single-tenant CE).
- gh.py — thin wrapper that prepends "gh " and delegates to bash so the
  token stays out of the model-visible command string; GH_TOKEN flows
  via deps.env_vars.

The OpenCode classifier copy in executor.py stays until PR2.E deletes
it; both are identical and both are tested.

Tests (44):
- test_permissions.py — the full classifier table ported verbatim from
  the OpenCode-era TestClassifyToolRequest, plus gate behavior
  (deny→ValueError even when approved, ask→ApprovalRequired,
  ask-approved→proceeds).
- test_tools.py — functional coverage of all five tools against a
  tmp_path workspace (run/trim/deny/ask/approve for bash; relative
  write + absolute/traversal/symlink-escape gating for edit; cap +
  missing-file for read; content-type allowlist for webfetch via
  httpx_mock; token-missing hint + bash delegation for gh).

ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Second slice of PR #2 (ADR-0047 / IMPL-0022): the tool-using
remediation_executor agent definition, plus the MCP-config adapter and
the runtime fixes that registering tools on a real Agent surfaced.

New:
- runtime/remediation_executor.py — build_agent(model, mcp_toolsets=())
  returns Agent[WorkspaceDeps, RemediationExecutorOutput |
  DeferredToolRequests], registering the five tools. SYSTEM_PROMPT is
  lifted from the OpenCode-era Jinja template — workspace-safety block
  and all seven hard rules preserved verbatim (each = a real shipped
  regression); the dynamic finding/plan context now arrives via the
  shared build_user_prompt (as the no-tools agents consume it); the JSON
  output-contract block is gone (output_type enforces the schema). The
  union output_type is what lets a gated tool's ApprovalRequired pause
  the run with a DeferredToolRequests output instead of erroring.
- runtime/tools/mcp.py — build_mcp_toolsets converts Cliff's resolved
  OpenCode-shaped MCP configs ({type, command, environment} / {type,
  url}) into PA MCPServerStdio / MCPServerStreamableHTTP toolsets,
  tool_prefix'd by integration id, malformed entries skipped not fatal.

Two runtime discoveries from wiring tools onto an Agent:

1. PA introspects each tool's type hints via get_type_hints at
   registration. With `from __future__ import annotations` the
   RunContext[WorkspaceDeps] annotation is a string that must resolve
   against module globals — so RunContext + WorkspaceDeps move from
   TYPE_CHECKING to runtime imports in all five tool modules. Scoped
   ruff per-file-ignore (TC001/TC002) for runtime/tools/*, matching the
   existing cliff/models/*.py precedent.

2. deny tier raises ModelRetry, not ValueError. IMPL-0022 named
   ValueError, but PA propagates a raw tool ValueError (crashing the
   run) whereas ModelRetry is converted to a RetryPromptPart the model
   sees and pivots on — which is exactly ADR-0047's stated intent ("the
   model gets a deterministic error and pivots"). Verified in PA's
   tool_manager.py. A model that keeps retrying a denied command hits
   PA's per-tool retry cap; no infinite loop. The verbatim classifier
   (auto/ask/deny classification) is unchanged — only the tier→exception
   translation differs. Flagged for an IMPL-0022 text update.

MCPServerStdio/StreamableHTTP are deprecated in PA 1.98 in favour of
MCPToolset, but MCPToolset's arbitrary-command form needs fastmcp (not a
dependency on the pinned line) — so the deprecated-but-working classes
are the only ones that express our `npx …` configs. Deprecation
suppressed at the construction site with a documented reason; revisit at
the PA v2 bump (tracked with the PR #3 config rename).

Tests (55 new): test_remediation_executor.py drives the agent with
TestModel (normal completion → RemediationExecutorOutput) and
FunctionModel (gated `rm -rf` → DeferredToolRequests with the pending
approval + command metadata; catastrophic `sudo` → ModelRetry → model
pivots, no defer). test_mcp.py covers the adapter (local→stdio,
remote→http, disabled/malformed/unknown skipped). Full agents suite: 98
passed. ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…C/1)

Durable storage for the PA DeferredToolRequests resume cycle, ahead of
the executor dispatch + resume wiring:

- migration 023 — adds nullable agent_run.pa_message_history TEXT. When
  the remediation_executor calls a gated tool the run pauses with a
  DeferredToolRequests output; resuming needs the full conversation fed
  back via agent.run(message_history=...). Storing PA's
  all_messages_json() in the DB (not in-memory / not a workspace file)
  keeps it atomic with the existing permission_request marker and
  survives a daemon restart mid-approval — same durability the marker
  got in migration 022.
- AgentRunUpdate.pa_message_history field so update_agent_run writes it;
  deliberately NOT on the API-facing AgentRun model (large internal blob).
- repo_agent_run.get_pa_message_history(run_id) — targeted reader for the
  resume path.
- reconcile_orphaned_agent_runs also NULLs pa_message_history so a
  restart mid-pause can't leave stale resume state (mirrors how it
  already clears permission_pending/permission_request).

Verified: migration applies + column present; 84 model/db/migration
tests pass; ruff clean. Dispatch + _run_pa_executor + resume API land in
the next commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
….C/2)

Routes the remediation_executor off OpenCode onto the in-process Pydantic
AI agent + the durable DeferredToolRequests park/resume cycle.

Executor (executor.py):
- _finalize_run: extracted from execute()'s post-parse+persist tail
  (ref-verify, evidence-guard, PR-verify, context/sidebar/advance,
  run-update) so the initial run AND the resume path reach identical
  finalization — one source of truth. Pure refactor, verified no-op
  against the existing suite before the new code landed.
- _resolve_active_model + _run_pa_call: shared helpers (resolver gather →
  build_model; and the rate-limit/timeout/UnexpectedModelBehavior
  translation loop) used by both the no-tools path and the executor —
  removes ~145 lines of duplication.
- _run_pa_executor: builds the executor agent (build_agent + resolved MCP
  toolsets), runs it via _run_pa_call. A DeferredToolRequests output →
  _PaExecutorOutcome carrying the permission marker + all_messages_json();
  a completed output → a ParseResult shaped like the no-tools path.
- _park_for_permission: persists the marker (permission_pending=1 +
  permission_request) + pa_message_history, returns the new
  ``awaiting_permission`` AgentExecutionResult status (run row stays
  ``running`` so issue_derivation routes to "Needs you").
- resume_executor: reloads deps, deserializes the message history,
  re-enters agent.run with DeferredToolResults (ToolApproved/ToolDenied
  per pending tool_call_id), then _finalize_run (or re-park).
- execute() dispatch: remediation_executor → _run_pa_executor; the old
  OpenCode tool-agent block is gone (its orphaned plumbing —
  _send_and_collect, the permission queue — is deleted in PR2.E).

pipeline.py: run_pipeline also stops on ``awaiting_permission`` (a
non-terminal pause, not a failure).

route (agent_execution.py): POST .../permission now resumes via
executor.resume_executor in a background task and returns immediately;
the frontend polls agent-runs for the outcome (no SSE — PR2.D deletes it).

tools: bash/edit gate metadata now carries ``patterns`` so the persisted
marker matches what the frontend approval prompt renders (tool + patterns).

Tests:
- test_deferred_tools_persist.py (new): real FunctionModel park→resume
  cycle — gated bash pauses with a marker + history, resume(ToolApproved)
  completes, marker cleared; resume-without-marker raises.
- test_executor.py: keeper tests retargeted to the PA path via a shared
  _executor_with_pa helper; OpenCode-machinery tests (pool-start, SSE
  error/progress, parse-retry, asyncio.Event approve) skipped with
  PR2.E reasons (same staged pattern PR #1 used).
- test_rate_limit_backoff.py: rewritten to exercise the shared
  _run_pa_call retry via a flaky run_no_tools_agent raising
  ModelHTTPError(429). The OpenCode string-classification corpus tests
  are retired — PA detects rate-limit structurally (status_code==429),
  not by matching error-message text. Jitter unit test preserved.
- test_permission_flow.py: 2 OpenCode approve/deny flow tests skipped
  (covered by test_deferred_tools_persist.py); the derivation-routing
  test stays.
- test_routes_agent_execution.py: TestPermissionEndpoint retargeted to
  the background resume_executor flow.

Full slice-C suite: 311 passed, 10 skipped, 0 failed; ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ery (PR2.D+E)

With the executor on Pydantic AI + a durable DeferredToolRequests marker
(PR2.C), the whole push-based progress/approval channel is dead weight —
the frontend already polls agent-runs (3s active / 5s idle) and renders
the permission prompt + activity feed from that. ADR-0047's "much simpler
mechanism": executor writes the marker → frontend polls → renders.

PR2.D — delete the frontend SSE:
- backend: removed GET /workspaces/{id}/agent-execution/stream + its
  event_generator; dropped the on_permission=push_permission_event wiring
  from execute_agent + run_all.
- frontend: removed the EventSource useEffect from IssueSidePanel (and the
  now-unused useQueryClient/api imports) and api.streamAgentExecution.
  The permission prompt + feed render entirely from the polled agent-runs
  query — the SSE was only ever a refetch nudge, carried no data, and
  streamed no tokens.
- tests: deleted IssueSidePanel.sse-progress.test.tsx + the EventSource
  stub/assertion in the permission-prompt test; removed the backend SSE
  stream route tests.

PR2.E — delete the orphaned OpenCode permission/streaming machinery in
executor.py:
- _PendingApproval, approve_tool, deny_tool, push_permission_event,
  get_permission_queue, ensure_permission_queue, _patch_permission_marker
- _send_and_collect + _collect_response (the OpenCode SSE consumer)
- _RETRY_PROMPT + the dead _RATE_LIMIT_SUBSTRINGS / _looks_like_rate_limit
  string-classifier (PA detects rate-limit structurally via
  ModelHTTPError.status_code == 429)
- _cleanup_workspace_state trimmed to just clearing the active-run marker;
  the _pending_approvals / _permission_queues / _permission_pending dicts
  are gone. (_active_runs / get_active_run_id / check_not_busy stay — those
  are run-tracking, not SSE.)
- tests: deleted TestPermissionApproval; skipped the 3 SSE-progress-event
  tests in TestAgentExecutor (PR2.D removed the events).

Note: cliff/workspace/repo_workspace_runner.py keeps its OWN
_send_and_collect (the repo-action generators are a separate OpenCode path
that PR #3 migrates). _STRUCTURED_OUTPUT_CONTRACTS + build_agent_prompt also
stay for PR #3's plumbing/template deletion.

openapi snapshot regenerated: drops the stream path, adds the optional
deny_message field on PermissionDecision (the resume API).

Verified: ruff clean; backend non-e2e suite 1570 passed / 14 skipped
(only failure is the pre-existing wiz-s3-public LLM-eval flake, untouched
by this branch); the SSE-touched frontend tests (permission-prompt +
IssueSidePanel core) pass. The 10 failing frontend chip-styling tests are
pre-existing design-token drift in components this branch never touches.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tate

/code-review (recall) confirmed: the resume setup steps inside the try —
ModelMessagesTypeAdapter.validate_json on a corrupt/truncated
pa_message_history, or a malformed tool_call_ids — raise
ValidationError/TypeError, which the specific
except (AgentTimeoutError|AgentRateLimitError|AgentProcessError) handler
didn't cover. The exception escaped to the route's catch-all logger and
the run stayed running + permission_pending forever (workspace busy
until daemon restart).

Two fixes:
- tool_call_ids = (... or {}).get("tool_call_ids") or []  — an explicit
  null in the marker JSON can no longer slip a None into the approvals
  comprehension.
- Added a catch-all except in resume_executor that marks the run failed
  (cleanup + status=failed + last_error), so NO resume error can leave
  the run wedged — the finding routes to the Retry CTA instead of hanging.

Test: test_resume_with_corrupt_history_marks_failed_not_wedged — a
non-JSON pa_message_history yields status=failed + marker cleared, not a
raised exception or a stuck run.

Review notes (no change needed):
- background-task DB use is safe — get_db yields a process-global
  singleton with no per-request close (same pattern execute_agent uses).
- edit approved-escape is by-design — PA sets ctx.tool_call_approved
  per-tool-call-id, so an approved escaping edit is the user consenting
  to that exact path (HITL).
- Dead OpenCode constants (PERMISSION_WAIT_BUFFER, TOOL_TIERS,
  _STRUCTURED_OUTPUT_CONTRACTS, build_agent_prompt) + stale frontend
  OpenAPI types (deleted stream endpoint, missing deny_message) are
  PR #3 / type-regen cleanup, deliberately deferred.

80 passed / 11 skipped on the executor+resume+route slice; ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR replaces the OpenCode subprocess/SSE remediation executor with an in-process Pydantic AI remediation agent, adds gated tool approval with durable pause/resume (pa_message_history), centralizes PA call/retry/error handling, removes SSE streaming in favor of polling, and updates DB, API, frontend, and tests.

Changes

PA executor migration and permission deferral

Layer / File(s) Summary
Pipeline stop condition
backend/cliff/agents/pipeline.py
run_pipeline treats awaiting_permission as a stopping/pausing status.
Remediation executor runtime
backend/cliff/agents/runtime/remediation_executor.py
Adds SYSTEM_PROMPT and build_agent(...) producing union output supporting DeferredToolRequests for gated tool deferral.
Tools package init
backend/cliff/agents/runtime/tools/__init__.py
Registers ordered EXECUTOR_TOOLS tuple exported for the remediation agent.
Bash tool
backend/cliff/agents/runtime/tools/bash.py
Async bash tool with gating, background subprocess execution, timeout, output trimming, and export.
GH wrapper tool
backend/cliff/agents/runtime/tools/gh.py
gh validates GH_TOKEN then delegates to bash.
DB migration and repo helpers
backend/cliff/db/migrations/023_agent_run_pa_message_history.sql, backend/cliff/db/repo_agent_run.py
Add pa_message_history column and repo helper get_pa_message_history; reconciliation clears the field when failing orphaned runs.
Model updates / OpenAPI
backend/cliff/models/__init__.py, backend/tests/api/openapi_snapshot.json
Add pa_message_history to AgentRunUpdate and deny_message to PermissionDecision; snapshot updated and streaming path removed.
Remediation executor tests
backend/tests/agents/test_remediation_executor.py
Tests for system prompt safety, tool registration, normal completion, deferral via DeferredToolRequests, and deny-tier handling.
Permission classifier & gating tests
backend/tests/agents/tools/test_permissions.py
Tests for classify_tool_request and gate_tool_call including ModelRetry and ApprovalRequired semantics.
API permission endpoint refactor
backend/cliff/api/routes/agent_execution.py
respond_to_permission resumed paused executor asynchronously with approved/deny_message; removed on_permission callbacks at background launch sites.
AgentExecutor core changes
backend/cliff/agents/executor.py
Add Pydantic AI message/tool primitives, remove OpenCode-era helpers, extend AgentExecutionResult.status with awaiting_permission, add _run_pa_call/_resolve_active_model with timeout/retry/exception mapping, initialize _active_runs.
PA executor, park/resume, finalization
backend/cliff/agents/executor.py
Implement _run_pa_executor, _park_for_permission, resume_executor (rehydrate message history, apply approvals/denials), and centralized _finalize_run.
Edit tool
backend/cliff/agents/runtime/tools/edit.py
edit tool with gating, resolved-path containment check, ApprovalRequired on escapes, background write, and relative-path display.
MCP toolset adapter
backend/cliff/agents/runtime/tools/mcp.py
Adapter building MCP toolsets from config dicts, skipping malformed/disabled entries with warnings.
Permissions classifier implementation
backend/cliff/agents/runtime/tools/permissions.py
Deny/ask/auto tiers, catastrophic denylist and gated patterns, _is_pipe_to_shell, classify_tool_request, and gate_tool_call raising ModelRetry or ApprovalRequired.
Read tool
backend/cliff/agents/runtime/tools/read.py
Workspace-scoped read tool with 50KB cap, escape refusal, threaded read, decode-with-replacement, and truncation marker.
Webfetch tool
backend/cliff/agents/runtime/tools/webfetch.py
HTTP GET wrapper with httpx streaming, content-type allowlist, 50KB cap, truncation marker, and formatted response.
Frontend SSE removal
frontend/src/api/client.ts, frontend/src/components/issues/IssueSidePanel.tsx, frontend/src/__tests__/IssueSidePanel.permission-prompt.test.tsx
Removed agent-execution SSE client method and in-component EventSource; added respondToPermission/respondToChatPermission client endpoints and updated tests to rely on polled agent-runs.
Deferred tools park/resume tests
backend/tests/agents/test_deferred_tools_persist.py
End-to-end tests asserting run parks with permission marker + persisted PA message history, and resume behavior for approve/deny, plus corrupt-history and marker-shaping tests.
MCP tests
backend/tests/agents/tools/test_mcp.py
Tests for local/remote MCP toolset construction, skipping disabled/malformed/unknown-type entries.
Tool functional tests
backend/tests/agents/tools/test_tools.py
Functional tests for bash/edit/read/webfetch/gh covering execution, safety gating, escapes, truncation, and delegation.
Executor unit test updates
backend/tests/test_executor.py
Refactor tests to stub PA execution (_executor_with_pa/_pa_success_outcome/_pr_created_output), skip OpenCode/SSE-specific tests, and rewire PR-verification/timeouts to PA seam.
Rate-limit backoff tests
backend/tests/integration/test_rate_limit_backoff.py
Rewrote to use ModelHTTPError-driven 429 detection and flaky-run helpers, plus zeroed-backoff helper and sustained-throttling scenario.
Permission route tests
backend/tests/test_routes_agent_execution.py
Updated tests to assert /permission schedules resume_executor for pending runs and returns 404 for non-pending or cross-workspace runs.
Integration skips and OpenAPI
backend/tests/integration/test_permission_flow.py, backend/tests/api/openapi_snapshot.json
Legacy in-process approval tests skipped; OpenAPI snapshot updated with new fields and streaming path removed.
Lint config
backend/pyproject.toml
Add per-file Ruff ignore for TC001/TC002 in runtime tools path.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • cliff-security/cliff#249: Touches Pydantic-AI execution retry/error handling and run_pa* logic, which relates to this PR’s centralized PA call handling.
  • cliff-security/cliff#213: Frontend permission prompt UI changes that complement this PR’s paused/permission flow semantics.

Poem

"🐰 A rabbit in the toolchain hops,
Pauses runs to check for props,
Saves the chat to pick up later,
Approves or denies with human gator,
SSE fades — polling keeps the crops."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: migrating the remediation_executor from OpenCode subprocess to Pydantic AI, indicating this is PR #2 of a larger series.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/pa-migrate-executor

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
frontend/src/__tests__/IssueSidePanel.permission-prompt.test.tsx (1)

63-86: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Exercise the paused-run contract in the fixture.

These tests still model the gated executor row as status: 'running', but the new backend flow parks it as awaiting_permission. Leaving the default fixture on the old status means this suite won’t catch the footer bug above once the real API starts returning parked runs.

Suggested fix
 function makeRun(overrides: Partial<AgentRun> = {}): AgentRun {
   return {
     id: 'run-1',
     workspace_id: 'ws-1',
     agent_type: 'remediation_executor',
-    status: 'running',
+    status: 'awaiting_permission',
     input_json: null,
     summary_markdown: null,
     confidence: null,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/__tests__/IssueSidePanel.permission-prompt.test.tsx` around
lines 63 - 86, The fixture makeRun currently returns a run with status 'running'
which prevents tests from exercising the paused-run contract; change the default
status in makeRun to 'awaiting_permission' so the
permission_pending/permission_request path is exercised by default (keep
permission_pending: true and permission_request as-is and allow callers to
override via the overrides parameter so existing tests can still set other
statuses when needed).
frontend/src/components/issues/IssueSidePanel.tsx (1)

1162-1203: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Match the parked run when rendering the permission footer.

This branch only looks up status === 'running', but the new park/resume flow persists the paused executor as awaiting_permission. In that case runningRun stays null, the footer remains stuck on “Waiting for approval details”, and the user can never approve or deny the deferred tool call.

Suggested fix
-  const runningRun =
-    agentRuns?.find((r) => r.status === 'running') ?? null
+  const actionableRun =
+    agentRuns?.find(
+      (r) => r.status === 'running' || r.status === 'awaiting_permission',
+    ) ?? null

   if (stage === 'awaiting_permission') {
-    const req = runningRun?.permission_request ?? null
-    if (!req || !runningRun) {
+    const req = actionableRun?.permission_request ?? null
+    if (!req || !actionableRun) {
       return (
         <div className="flex items-center gap-3 w-full">
           <CliffSpinner size={14} label="Loading approval details" />
@@
         errorMessage={
           respondToPermission.isError
             ? friendlyPermissionError(respondToPermission.error)
             : null
         }
         onApprove={() =>
-          respondToPermission.mutate({ runId: runningRun.id, approved: true })
+          respondToPermission.mutate({ runId: actionableRun.id, approved: true })
         }
         onDeny={() =>
-          respondToPermission.mutate({ runId: runningRun.id, approved: false })
+          respondToPermission.mutate({ runId: actionableRun.id, approved: false })
         }
       />
     )
   }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/components/issues/IssueSidePanel.tsx` around lines 1162 - 1203,
The permission-footer branch only looks up agentRuns.find(...) for status ===
'running', so when a run is parked as 'awaiting_permission' the UI never binds
the parked run and stays in the loading state; change the lookup used by
IssueSidePanel (the runningRun constant) to include both statuses (e.g.,
agentRuns?.find(r => r.status === 'running' || r.status ===
'awaiting_permission') ?? null) so the PermissionPrompt receives the parked run
(its permission_request and id) and the respondToPermission.mutate calls will
target the correct run.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/cliff/agents/executor.py`:
- Around line 833-834: The code calls _finalize_run assuming
outcome.parse_result exists even though _PaExecutorOutcome.parse_result can be
None; add an explicit guard/invariant before calling _finalize_run: check that
outcome.parse_result is not None (raise an AssertionError or ValueError with a
clear message if it is) and then pass the non-None parse_result to
_finalize_run; apply the same explicit check in the other occurrence referenced
(around the 1617–1622 block) to keep typing strict and avoid hidden None runtime
paths, and update any local type narrowing so callers treat parse_result as
non-Optional after the check.

In `@backend/cliff/agents/runtime/tools/bash.py`:
- Around line 62-81: The current _run() uses subprocess.run(...) and
asyncio.to_thread which buffers all output and may leave child processes alive
on timeout; replace this with an asyncio-based spawn using
asyncio.create_subprocess_shell/create_subprocess_exec (or subprocess.Popen with
start_new_session=True) so the process is started in a new session/process
group, stream stdout/stderr into bounded deques (or byte/line ring buffers) to
enforce output limits instead of capture_output=True, and enforce the timeout
with asyncio.wait_for; on timeout kill the whole process group
(os.killpg(process.pid, signal.SIGTERM) then SIGKILL if needed) and return the
trimmed buffered output. Update references in this file: replace uses of _run,
subprocess.run, and asyncio.to_thread with the new async streaming
implementation, keep using ctx.deps.workspace_dir, ctx.deps.env_vars, and
_BASH_TIMEOUT_SECONDS, and ensure _trim_output is replaced/rewired to use the
bounded buffers you populate incrementally.

In `@backend/cliff/agents/runtime/tools/edit.py`:
- Around line 62-70: The post-write call to target.relative_to(...) can raise
ValueError if the approved target is outside the workspace, turning a successful
write into a failure; wrap the relative_to() call in a try/except (catch
ValueError) and on exception produce a safe fallback path string (for example
use str(target) or a posix-style "../..." path) so the function still returns
success; modify the block around rel =
target.relative_to(Path(ctx.deps.workspace_dir).resolve()) in edit.py
(referencing target and the _write function) to attempt relative_to and fall
back to a non-crashing representation when ValueError is raised.

In `@backend/cliff/agents/runtime/tools/mcp.py`:
- Around line 85-103: Validate the MCP config types before constructing
MCPServerStdio or MCPServerStreamableHTTP: ensure `command` is a non-empty
list/tuple of strings (not a single string) before using command[0] and args,
and ensure `url` is a string before constructing the remote server; if the
values are malformed, raise a ValueError with a clear message (including
`entry_id`) so callers can skip the malformed toolset instead of invoking an
invalid tool.
- Around line 42-49: The helper context manager _suppress_mcp_v2_deprecation
lacks an explicit return type; annotate its signature as -> Iterator[None] and
add the necessary import (from typing import Iterator) so the function becomes
typed as an iterator-based contextmanager for backend strict typing (update the
_suppress_mcp_v2_deprecation definition and imports accordingly).

In `@backend/cliff/agents/runtime/tools/read.py`:
- Around line 28-45: The read tool currently resolves target =
(Path(ctx.deps.workspace_dir) / path).resolve() and opens it without approval or
containment checks; update the code to (1) call the centralized approval/gating
function (e.g., ctx.deps.gate_tool_call or equivalent) before reading, and (2)
enforce that the resolved target is inside the workspace by comparing
Path(ctx.deps.workspace_dir).resolve() and using
target.relative_to(workspace_dir) (or an explicit check that
target.resolve().is_relative_to(workspace_dir.resolve()) / raises if not) and
return a safe error if the check fails; keep the existing _read function and
asyncio.to_thread usage and preserve _MAX_READ_BYTES handling but never open
files outside the workspace or via symlink traversal without explicit approval.

In `@backend/cliff/agents/runtime/tools/webfetch.py`:
- Around line 31-52: The current implementation uses await client.get(url) and
resp.text which buffers the entire response before checking content-type and
enforcing _MAX_BODY_BYTES; change to a streaming fetch using
httpx.AsyncClient.stream (e.g., async with client.stream("GET", url) as resp) so
you can check resp.headers.get("content-type") against _ALLOWED_CONTENT_TYPES
before reading the body, then read chunks (resp.aiter_bytes or resp.aiter_raw)
accumulating at most _MAX_BODY_BYTES + 1 bytes and stop once that limit is
reached to detect truncation, decode the collected bytes with errors="replace",
append the truncation notice if the extra byte was read, and return the same
"HTTP {resp.status_code}\n{body}" string; preserve the existing HTTPError
handling around the request and continue using _WEBFETCH_TIMEOUT_SECONDS for the
client timeout.

In `@backend/cliff/api/routes/agent_execution.py`:
- Around line 532-540: The handler respond_to_permission currently checks that
the run exists and is permission_pending but does not verify ownership; after
calling get_agent_run and get_workspace, add an explicit check that
run.workspace_id == workspace_id (reference: get_agent_run, run.workspace_id,
get_workspace, respond_to_permission) and if they do not match raise an
HTTPException (404 or 403 per project convention) with an appropriate message
like "Agent run not found in this workspace" before proceeding with the
permission decision.
- Line 518: The new endpoint's db parameter is untyped; annotate it as a
SQLAlchemy Session to keep strict typing consistent by changing the parameter
from db=Depends(get_db) to db: Session = Depends(get_db) in the endpoint
signature (reference the db parameter and the get_db dependency), and add the
corresponding import (from sqlalchemy.orm import Session) at the top of the
module.

In `@backend/cliff/models/__init__.py`:
- Around line 173-176: The AgentRunStatus union/enum is missing the
"awaiting_permission" state which causes Pydantic validation to fail for
AgentRunUpdate paths that set status to "awaiting_permission"; update the
AgentRunStatus definition in backend/cliff/models/__init__.py (the
AgentRunStatus type/Enum used by AgentRun and AgentRunUpdate) to include the
literal/value "awaiting_permission", and run through any related validators or
type annotations (e.g., usages of AgentRunStatus, AgentRunUpdate, or status
fields) to ensure they accept this new state so pause/resume persistence and
park/resume flows validate correctly.

In `@backend/tests/agents/test_deferred_tools_persist.py`:
- Around line 39-45: The new fixture workspace_dir and other newly added helper
functions in this test module lack type hints; add strict annotations for
parameters and return types (e.g., annotate the pytest tmp_path parameter as
pathlib.Path and the fixture return as str for workspace_dir), and similarly
annotate all functions referenced in the review (the defs around lines 51-57 and
77-113) with concrete types (use pathlib.Path, dict[str, Any], list[T], bool, or
specific models as appropriate) so the module complies with the backend strict
typing guideline.

In `@backend/tests/agents/test_remediation_executor.py`:
- Around line 60-66: The local callback functions like _call_gated_bash need
explicit type annotations to satisfy strict typing: add parameter and return
type hints (e.g., def _call_gated_bash(messages: list[Message], info:
FunctionCallbackInfo) -> ModelResponse:) matching your FunctionModel callback
signature and any other callbacks in the same file (lines ~87-103), import or
reference the correct Message and FunctionCallbackInfo/ModelResponse types from
the codebase, and ensure any dicts/args use typing (dict[str, Any]) where
appropriate so the functions are fully typed.

In `@backend/tests/agents/tools/test_tools.py`:
- Around line 113-115: The test test_approved_write_outside_proceeds is passing
a workspace-relative path ("src/ok.py") so it never hits the "outside" edit
branch; change the test to pass an actual filesystem path outside the workspace
(e.g. construct an absolute path using tmp_path / "ok_outside.py" and ensure the
file exists) when calling edit(_ctx(tmp_path, approved=True), <outside_path>,
"y\n") so the outside/ask-tier path is exercised, then keep the existing assert
"Wrote" in result; adjust only the path argument to edit and any necessary test
file setup.

In `@backend/tests/integration/test_rate_limit_backoff.py`:
- Around line 91-93: The helper factory _make_flaky_run should declare full type
hints: annotate parameters (error_attempts: int, status_code: int = 429,
always_error: bool = False) and specify a precise return type (e.g.,
Callable[..., Coroutine[Any, Any, ReturnType]] or Callable[[...],
Awaitable[ResponseType]] depending on what the helper returns); update the inner
helper callables referenced around lines 102-103 and 113-116 similarly so every
helper parameter and return is explicitly typed (use typing.Callable,
typing.Awaitable/Coroutine, and concrete response/result types from the test
helpers or HTTP client types) to satisfy strict Python 3.11 backend typing
rules.

In `@backend/tests/test_executor.py`:
- Around line 111-116: Add strict type hints to the PA test helper functions:
annotate _executor_with_pa parameters and return type — e.g., declare mock_pool:
unittest.mock.Mock (or the specific Pool type), mock_context_builder:
Callable[..., Any], outcome_factory: Callable[[], _PaExecutorOutcome] and give
the function a concrete return type (the Executor/test helper type it produces);
similarly add matching type annotations to the other helper(s) referenced around
lines 135-150. Ensure you import typing symbols (Callable, Any) and the
_PaExecutorOutcome and Executor types so the signatures are fully typed.

In `@backend/tests/test_routes_agent_execution.py`:
- Around line 767-769: Replace the timing-dependent asyncio.sleep usage with an
explicit synchronization primitive: create an asyncio.Event in the test and
update the test helper that appends to the calls list (the same place that
currently appends ("run-1", True)) to call event.set() when it records the call;
then await asyncio.wait_for(event.wait(), timeout=...) instead of
asyncio.sleep(0.02). Apply the same change for the other occurrence that waits
for the background resume (lines referencing calls and the ("run-1", True)
assertion) so the test deterministically waits for the background resume to
complete.
- Around line 709-714: Annotate the new helper and the async test signatures
with explicit type hints: change def _permission_route_patches(*, pending: bool)
to include a return type (for example def _permission_route_patches(*, pending:
bool) -> Iterator[None] and import Iterator from typing, or use
ContextManager[Any] if it yields a context manager), and ensure every async test
function in this file has an explicit return annotation (e.g. async def
test_xyz(...) -> None). Also add missing parameter type annotations for any
other params referenced in those helpers (e.g. any mocked objects) so all new
helper and test signatures are fully typed per backend strict-typing rules.

---

Outside diff comments:
In `@frontend/src/__tests__/IssueSidePanel.permission-prompt.test.tsx`:
- Around line 63-86: The fixture makeRun currently returns a run with status
'running' which prevents tests from exercising the paused-run contract; change
the default status in makeRun to 'awaiting_permission' so the
permission_pending/permission_request path is exercised by default (keep
permission_pending: true and permission_request as-is and allow callers to
override via the overrides parameter so existing tests can still set other
statuses when needed).

In `@frontend/src/components/issues/IssueSidePanel.tsx`:
- Around line 1162-1203: The permission-footer branch only looks up
agentRuns.find(...) for status === 'running', so when a run is parked as
'awaiting_permission' the UI never binds the parked run and stays in the loading
state; change the lookup used by IssueSidePanel (the runningRun constant) to
include both statuses (e.g., agentRuns?.find(r => r.status === 'running' ||
r.status === 'awaiting_permission') ?? null) so the PermissionPrompt receives
the parked run (its permission_request and id) and the
respondToPermission.mutate calls will target the correct run.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 93b1665d-8d71-4054-b0aa-65255aef8408

📥 Commits

Reviewing files that changed from the base of the PR and between cef63fb and c25f096.

📒 Files selected for processing (31)
  • backend/cliff/agents/executor.py
  • backend/cliff/agents/pipeline.py
  • backend/cliff/agents/runtime/remediation_executor.py
  • backend/cliff/agents/runtime/tools/__init__.py
  • backend/cliff/agents/runtime/tools/bash.py
  • backend/cliff/agents/runtime/tools/edit.py
  • backend/cliff/agents/runtime/tools/gh.py
  • backend/cliff/agents/runtime/tools/mcp.py
  • backend/cliff/agents/runtime/tools/permissions.py
  • backend/cliff/agents/runtime/tools/read.py
  • backend/cliff/agents/runtime/tools/webfetch.py
  • backend/cliff/api/routes/agent_execution.py
  • backend/cliff/db/migrations/023_agent_run_pa_message_history.sql
  • backend/cliff/db/repo_agent_run.py
  • backend/cliff/models/__init__.py
  • backend/pyproject.toml
  • backend/tests/agents/test_deferred_tools_persist.py
  • backend/tests/agents/test_remediation_executor.py
  • backend/tests/agents/tools/__init__.py
  • backend/tests/agents/tools/test_mcp.py
  • backend/tests/agents/tools/test_permissions.py
  • backend/tests/agents/tools/test_tools.py
  • backend/tests/api/openapi_snapshot.json
  • backend/tests/integration/test_permission_flow.py
  • backend/tests/integration/test_rate_limit_backoff.py
  • backend/tests/test_executor.py
  • backend/tests/test_routes_agent_execution.py
  • frontend/src/__tests__/IssueSidePanel.permission-prompt.test.tsx
  • frontend/src/__tests__/IssueSidePanel.sse-progress.test.tsx
  • frontend/src/api/client.ts
  • frontend/src/components/issues/IssueSidePanel.tsx
💤 Files with no reviewable changes (2)
  • frontend/src/tests/IssueSidePanel.sse-progress.test.tsx
  • frontend/src/api/client.ts

Comment on lines +833 to 834
parse_result = outcome.parse_result
else:
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard outcome.parse_result before calling _finalize_run.

Both paths assume parse_result exists when permission_request is absent, but _PaExecutorOutcome allows parse_result=None. Add an explicit invariant check to prevent a hidden None runtime path and keep typing strict.

Proposed fix
                 if outcome.permission_request is not None:
                     return await self._park_for_permission(
                         db, workspace_id, agent_run, outcome, start_time
                     )
-                parse_result = outcome.parse_result
+                if outcome.parse_result is None:
+                    raise AgentProcessError(
+                        "Executor outcome missing parse_result for completed run."
+                    )
+                parse_result = outcome.parse_result
...
             if outcome.permission_request is not None:
                 return await self._park_for_permission(
                     db, workspace_id, agent_run, outcome, start_time
                 )
+            if outcome.parse_result is None:
+                raise AgentProcessError(
+                    "Executor outcome missing parse_result for completed resume."
+                )
             return await self._finalize_run(
                 db,
                 workspace_id=workspace_id,
                 agent_type=agent_run.agent_type,
                 parse_result=outcome.parse_result,

As per coding guidelines, backend/**/*.py: Use Python 3.11+ with strict type hints and Pydantic models for backend code.

Also applies to: 1617-1622

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/cliff/agents/executor.py` around lines 833 - 834, The code calls
_finalize_run assuming outcome.parse_result exists even though
_PaExecutorOutcome.parse_result can be None; add an explicit guard/invariant
before calling _finalize_run: check that outcome.parse_result is not None (raise
an AssertionError or ValueError with a clear message if it is) and then pass the
non-None parse_result to _finalize_run; apply the same explicit check in the
other occurrence referenced (around the 1617–1622 block) to keep typing strict
and avoid hidden None runtime paths, and update any local type narrowing so
callers treat parse_result as non-Optional after the check.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 9706d20 addressed this comment by adding an explicit guard that raises an AgentProcessError whenever the main tool-agent path returns a None parse_result, ensuring _finalize_run always receives a valid result after permission_request is absent.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipped: comment is from another GitHub bot.

Comment on lines +62 to +81
def _run() -> subprocess.CompletedProcess[str]:
return subprocess.run(
command,
shell=True, # noqa: S602 — the remediation agent legitimately
# needs a shell (pipes, &&, redirects); the denylist gate above
# is the defense-in-depth layer, not shell=False.
cwd=ctx.deps.workspace_dir,
capture_output=True,
text=True,
timeout=_BASH_TIMEOUT_SECONDS,
env=ctx.deps.env_vars or None,
)

try:
proc = await asyncio.to_thread(_run)
except subprocess.TimeoutExpired:
return (
f"Command timed out after {_BASH_TIMEOUT_SECONDS:.0f}s and was "
"killed. If this was a long build, split it into smaller steps."
)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="backend/cliff/agents/runtime/tools/bash.py"

# Show the relevant portion with line numbers
nl -ba "$FILE" | sed -n '1,220p'

# Focus search for helpers/constants used in/around the snippet
rg -n "_trim_output|_BASH_TIMEOUT_SECONDS|trim|denylist|subprocess\.run|asyncio\.to_thread" "$FILE"

# Also locate any other references to this tool in the repo
rg -n "tools/bash\.py|def bash\(|async def bash\(" backend | head -n 50

Repository: cliff-security/cliff

Length of output: 107


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="backend/cliff/agents/runtime/tools/bash.py"

python3 - <<'PY'
import itertools
path = "backend/cliff/agents/runtime/tools/bash.py"
with open(path, "r", encoding="utf-8") as f:
    lines = f.readlines()

start, end = 40, 120
for i in range(start-1, min(end, len(lines))):
    print(f"{i+1:4d}: {lines[i].rstrip()}")
PY

rg -n "_trim_output|_BASH_TIMEOUT_SECONDS|trim|denylist|subprocess\.run|asyncio\.to_thread" "$FILE"

# show full definitions around the match if any
rg -n "def _run|async def bash|asyncio\.to_thread|shell=True" "$FILE" -n

Repository: cliff-security/cliff

Length of output: 3182


🌐 Web query:

Python subprocess.run shell=True timeout TimeoutExpired child processes continue after timeout capture_output buffers full stdout stderr in memory

💡 Result:

When using subprocess.run with shell=True, timeout, and capture_output=True, you may encounter several critical behaviors: 1. Orphaned Child Processes: If the command executed via shell=True spawns its own child processes, terminating the shell process (the direct child of Python) upon a TimeoutExpired exception does not automatically terminate those "grandchild" processes [1][2]. These descendants are orphaned and inherited by the init process (PID 1), continuing to run in the background [3][2]. 2. Memory Buffering and Deadlocks: When capture_output=True is used, subprocess.run internally uses subprocess.PIPE for stdout and stderr [4][5]. These pipes are managed via OS-level buffers [6]. If a subprocess generates enough output to fill these buffers (typically 64 KB on many systems) and the parent process stops reading from them—or is delayed during the timeout cleanup phase—the child process will block indefinitely, waiting for the buffer to clear [4][6]. This can result in the timeout mechanism failing to raise or clean up as expected [1][4]. 3. Timeout Limitations: The timeout argument in subprocess.run is passed internally to Popen.communicate [7][8][5]. While it attempts to kill the direct child process upon expiration, the complexity of shell=True (where the shell process may not propagate signals correctly to its own children) and the potential for pipe-related blocking can lead to scenarios where the Python call appears to hang or fails to fully reclaim resources [1][2]. To mitigate these issues: - Avoid shell=True whenever possible to maintain a direct process tree and better signal control [7][8]. - Use start_new_session=True in Popen to create a new process group, allowing you to kill the entire group (e.g., using os.killpg) upon a timeout [3]. - If you must use shell=True and require robust cleanup, consider using the Popen constructor directly to manually manage the communication and signal termination of the process group [3][2]. - For large volumes of output, avoid capturing all output in memory via PIPE. Instead, stream the output to files or handle it in a separate thread to prevent pipe buffer deadlocks [4][9][10].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail
python3 - <<'PY'
path = "backend/cliff/agents/runtime/tools/bash.py"
with open(path, "r", encoding="utf-8") as f:
    lines = f.readlines()
for i in range(1, 55):
    if i <= len(lines):
        print(f"{i:4d}: {lines[i-1].rstrip()}")
PY

Repository: cliff-security/cliff

Length of output: 2459


Use a managed process group + bounded streaming for timeouts and output limiting

  • In backend/cliff/agents/runtime/tools/bash.py, _run() uses subprocess.run(..., shell=True, timeout=...); on TimeoutExpired the code reports the command was “killed”, but subprocess.run primarily terminates the shell process—child/grandchild processes can still survive unless you kill the whole process group/session.
  • _trim_output() only trims after subprocess.run(..., capture_output=True) has fully buffered all stdout/stderr in memory, so a noisy command can still consume unbounded memory even though you later keep only the last 200 lines.

Implement via Popen/asyncio subprocess with start_new_session/setsid, stream output into a bounded buffer, and on timeout kill the entire process group.

🧰 Tools
🪛 OpenGrep (1.22.0)

[ERROR] 63-73: Dynamic command passed to subprocess with shell=True. Use a command list without shell=True, or use shlex.quote() to sanitize input.

(coderabbit.command-injection.python-shell-true)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/cliff/agents/runtime/tools/bash.py` around lines 62 - 81, The current
_run() uses subprocess.run(...) and asyncio.to_thread which buffers all output
and may leave child processes alive on timeout; replace this with an
asyncio-based spawn using asyncio.create_subprocess_shell/create_subprocess_exec
(or subprocess.Popen with start_new_session=True) so the process is started in a
new session/process group, stream stdout/stderr into bounded deques (or
byte/line ring buffers) to enforce output limits instead of capture_output=True,
and enforce the timeout with asyncio.wait_for; on timeout kill the whole process
group (os.killpg(process.pid, signal.SIGTERM) then SIGKILL if needed) and return
the trimmed buffered output. Update references in this file: replace uses of
_run, subprocess.run, and asyncio.to_thread with the new async streaming
implementation, keep using ctx.deps.workspace_dir, ctx.deps.env_vars, and
_BASH_TIMEOUT_SECONDS, and ensure _trim_output is replaced/rewired to use the
bounded buffers you populate incrementally.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring to a follow-up. Process-group kill on timeout + bounded stdout streaming is real hardening, but it's orthogonal to the PA migration — the same behavior existed pre-migration (this is a verbatim port of the existing bash execution). Keeping PR #2 scoped to the substrate swap; the CE single-tenant trust model (the operator's own machine) makes it non-urgent. Flagging for the backlog rather than expanding this PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@galanko, understood — the rationale is solid: this is a verbatim port, the CE trust model reduces urgency, and keeping #2 scoped to the substrate swap is the right call.

Would you like me to open a GitHub issue to track process-group kill on timeout + bounded stdout streaming for the bash tool? I can include a backlink to this PR and comment so it's easy to pick up later.

Comment thread backend/cliff/agents/runtime/tools/edit.py Outdated
Comment thread backend/cliff/agents/runtime/tools/mcp.py
Comment on lines +85 to +103
command = cfg["command"]
if not command:
raise ValueError("local MCP config has empty command")
with _suppress_mcp_v2_deprecation():
return MCPServerStdio(
command=command[0],
args=list(command[1:]),
env=cfg.get("environment") or None,
tool_prefix=entry_id,
)

if server_type == "remote":
url = cfg["url"]
with _suppress_mcp_v2_deprecation():
return MCPServerStreamableHTTP(
url=url,
headers=cfg.get("headers") or None,
tool_prefix=entry_id,
)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate MCP config value types before constructing toolsets.

command and url are used without type checks. A malformed command string is currently treated as a char array and produces an invalid tool invocation instead of being skipped as malformed.

Proposed fix
 def _build_one(entry_id: str, cfg: dict[str, Any]) -> AbstractToolset[Any] | None:
     server_type = cfg.get("type", "local")

     if server_type == "local":
         command = cfg["command"]
-        if not command:
+        if (
+            not isinstance(command, (list, tuple))
+            or not command
+            or not all(isinstance(part, str) and part for part in command)
+        ):
             raise ValueError("local MCP config has empty command")
         with _suppress_mcp_v2_deprecation():
             return MCPServerStdio(
                 command=command[0],
                 args=list(command[1:]),
                 env=cfg.get("environment") or None,
                 tool_prefix=entry_id,
             )

     if server_type == "remote":
         url = cfg["url"]
+        if not isinstance(url, str) or not url.strip():
+            raise ValueError("remote MCP config has invalid url")
         with _suppress_mcp_v2_deprecation():
             return MCPServerStreamableHTTP(
                 url=url,
                 headers=cfg.get("headers") or None,
                 tool_prefix=entry_id,
             )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/cliff/agents/runtime/tools/mcp.py` around lines 85 - 103, Validate
the MCP config types before constructing MCPServerStdio or
MCPServerStreamableHTTP: ensure `command` is a non-empty list/tuple of strings
(not a single string) before using command[0] and args, and ensure `url` is a
string before constructing the remote server; if the values are malformed, raise
a ValueError with a clear message (including `entry_id`) so callers can skip the
malformed toolset instead of invoking an invalid tool.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 9706d20 addressed this comment by adding explicit type and non-empty checks for the local MCP command config before using command[0], preventing malformed commands from being treated as char sequences even though remote URL validation remains to be added.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipped: comment is from another GitHub bot.

Comment thread backend/tests/agents/tools/test_tools.py Outdated
Comment on lines +91 to +93
def _make_flaky_run(
error_attempts: int, *, status_code: int = 429, always_error: bool = False
):
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Type the new helper callables explicitly.

The new flaky-run/backoff helpers should have explicit parameter and return types for strict backend typing compliance.

Suggested typing for new helpers
@@
 from __future__ import annotations
 
 import json
 import time
 from pathlib import Path  # noqa: TC003 — runtime use in fixture annotation
+from collections.abc import Awaitable, Callable
+from typing import Any
@@
 def _make_flaky_run(
     error_attempts: int, *, status_code: int = 429, always_error: bool = False
-):
+) -> tuple[Callable[[str, Any, Any], Awaitable[dict[str, Any]]], dict[str, int]]:
@@
-    async def _run(agent_type, deps, model):  # noqa: ARG001
+    async def _run(agent_type: str, deps: Any, model: Any) -> dict[str, Any]:  # noqa: ARG001
@@
-def _zero_backoff(monkeypatch):
+def _zero_backoff(monkeypatch: pytest.MonkeyPatch) -> None:

As per coding guidelines, backend/**/*.py: "Use Python 3.11+ with strict type hints and Pydantic models for backend code".

Also applies to: 102-103, 113-116

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/tests/integration/test_rate_limit_backoff.py` around lines 91 - 93,
The helper factory _make_flaky_run should declare full type hints: annotate
parameters (error_attempts: int, status_code: int = 429, always_error: bool =
False) and specify a precise return type (e.g., Callable[..., Coroutine[Any,
Any, ReturnType]] or Callable[[...], Awaitable[ResponseType]] depending on what
the helper returns); update the inner helper callables referenced around lines
102-103 and 113-116 similarly so every helper parameter and return is explicitly
typed (use typing.Callable, typing.Awaitable/Coroutine, and concrete
response/result types from the test helpers or HTTP client types) to satisfy
strict Python 3.11 backend typing rules.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declining. This repo's ruff config doesn't enable the ANN ruleset (select = ["E","F","I","N","UP","B","SIM","TCH"]), and the existing test helpers/fixtures are untyped by convention. Annotating only the new helpers would be inconsistent with the surrounding suite. Enforced test typing would be a repo-wide convention change (enable ANN + backfill), which is a separate cleanup — out of scope for this migration PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood — the ruff ANN ruleset is not enabled, test helpers are intentionally left untyped, and a partial annotation would break consistency. I'll note this for future reviews.

Would you like me to resolve this comment?

(´• ω •`) 🐇


✏️ Learnings added
Learnt from: galanko
Repo: cliff-security/cliff PR: 251
File: backend/tests/integration/test_rate_limit_backoff.py:91-93
Timestamp: 2026-06-01T14:30:39.014Z
Learning: In the `cliff-security/cliff` repository, the ruff linter configuration uses `select = ["E","F","I","N","UP","B","SIM","TCH"]` — the `ANN` (annotation) ruleset is intentionally excluded. Test helpers and fixtures are untyped by convention; adding type annotations only to individual new test helpers is considered inconsistent with the surrounding suite. Enforcing test-level type annotations would be a repo-wide cleanup requiring `ANN` enablement plus a full backfill — do not flag missing type hints on test helper functions in this repo.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment thread backend/tests/test_executor.py
Comment on lines +709 to +714
def _permission_route_patches(*, pending: bool):
"""Patch the route's get_agent_run / get_workspace / env-resolver.

resp = await client.post(
"/api/workspaces/ws-1/agent-runs/run-1/permission",
json={"approved": True},
)

assert resp.status_code == 200
body = resp.json()
assert body["status"] == "approved"
assert body["agent_run_id"] == "run-1"

@pytest.mark.asyncio
async def test_deny_calls_executor_deny_tool(self, app, client):
executor = app.state.agent_executor
executor.deny_tool = lambda run_id: True

resp = await client.post(
"/api/workspaces/ws-1/agent-runs/run-1/permission",
json={"approved": False},
)

assert resp.status_code == 200
body = resp.json()
assert body["status"] == "denied"
assert body["agent_run_id"] == "run-1"

@pytest.mark.asyncio
async def test_no_pending_returns_404(self, app, client):
executor = app.state.agent_executor
executor.approve_tool = lambda run_id: False
executor.deny_tool = lambda run_id: False

resp = await client.post(
"/api/workspaces/ws-1/agent-runs/gone/permission",
json={"approved": True},
)

assert resp.status_code == 404
assert "No pending permission request" in resp.json()["detail"]

@pytest.mark.asyncio
async def test_approve_routed_to_approve_not_deny(self, app, client):
"""Trust guard — make sure ``approved=true`` doesn't accidentally
wire to deny_tool. Regression catcher for the conditional in
``respond_to_permission``."""
approve_calls = []
deny_calls = []

executor = app.state.agent_executor
executor.approve_tool = lambda run_id: approve_calls.append(run_id) or True
executor.deny_tool = lambda run_id: deny_calls.append(run_id) or True

resp = await client.post(
"/api/workspaces/ws-1/agent-runs/run-77/permission",
json={"approved": True},
)
``pending`` controls whether the agent_run reports a pending permission
request (the 404 gate keys off it).
"""
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add strict type hints to the newly introduced backend test helpers.

The new helper and async test signatures should be explicitly typed to meet the backend strict-typing requirement.

Type-hint pass for new signatures
@@
-def _permission_route_patches(*, pending: bool):
+def _permission_route_patches(*, pending: bool) -> tuple[Any, Any, Any]:
@@
-    async def test_approve_resumes_executor(self, app, client):
+    async def test_approve_resumes_executor(
+        self, app: FastAPI, client: AsyncClient
+    ) -> None:
@@
-        async def _resume(db, ws, rid, *, approved, workspace_dir,
-                          deny_message=None, env_vars=None):
+        async def _resume(
+            db: Any,
+            ws: Any,
+            rid: str,
+            *,
+            approved: bool,
+            workspace_dir: str,
+            deny_message: str | None = None,
+            env_vars: dict[str, str] | None = None,
+        ) -> None:
@@
-    async def test_deny_resumes_executor_with_denied(self, app, client):
+    async def test_deny_resumes_executor_with_denied(
+        self, app: FastAPI, client: AsyncClient
+    ) -> None:
@@
-    async def test_no_pending_returns_404(self, app, client):
+    async def test_no_pending_returns_404(
+        self, app: FastAPI, client: AsyncClient
+    ) -> None:

As per coding guidelines, backend/**/*.py: "Use Python 3.11+ with strict type hints and Pydantic models for backend code".

Also applies to: 746-752, 775-780, 801-801

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/tests/test_routes_agent_execution.py` around lines 709 - 714,
Annotate the new helper and the async test signatures with explicit type hints:
change def _permission_route_patches(*, pending: bool) to include a return type
(for example def _permission_route_patches(*, pending: bool) -> Iterator[None]
and import Iterator from typing, or use ContextManager[Any] if it yields a
context manager), and ensure every async test function in this file has an
explicit return annotation (e.g. async def test_xyz(...) -> None). Also add
missing parameter type annotations for any other params referenced in those
helpers (e.g. any mocked objects) so all new helper and test signatures are
fully typed per backend strict-typing rules.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declining. This repo's ruff config doesn't enable the ANN ruleset (select = ["E","F","I","N","UP","B","SIM","TCH"]), and the existing test helpers/fixtures are untyped by convention. Annotating only the new helpers would be inconsistent with the surrounding suite. Enforced test typing would be a repo-wide convention change (enable ANN + backfill), which is a separate cleanup — out of scope for this migration PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@galanko, understood — that's a fair call. Ruff not enabling ANN makes partial annotation noise rather than signal, and consistency with the existing untyped test suite is the right call here.

(°–°) 🐇


✏️ Learnings added
Learnt from: galanko
Repo: cliff-security/cliff PR: 251
File: backend/tests/test_routes_agent_execution.py:0-0
Timestamp: 2026-06-01T14:30:37.773Z
Learning: In the cliff-security/cliff repository, test helpers and pytest fixtures are intentionally untyped by convention. The ruff config (backend/pyproject.toml) uses `select = ["E","F","I","N","UP","B","SIM","TCH"]` and does not enable the `ANN` ruleset. Suggesting type annotations on new test helpers/fixtures should be avoided, as it would be inconsistent with the surrounding suite. Enabling `ANN` and backfilling annotations is a separate repo-wide cleanup effort, out of scope for individual feature PRs.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment thread backend/tests/test_routes_agent_execution.py Outdated
@galanko
Copy link
Copy Markdown
Collaborator Author

galanko commented Jun 1, 2026

baz review

Comment on lines +97 to +101
def classify_tool_request(tool: str, patterns: list[str]) -> str:
"""Return ``"auto"``, ``"ask"``, or ``"deny"`` for an executor tool call.

Ported verbatim from ``executor._classify_tool_request``.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classify_tool_request duplicates executor._classify_tool_request, so should we keep a single helper and have executor/tests import it to avoid drift when the deny-list or ask paths change?

Severity

Want Baz to fix this for you? Activate Fixer

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 9706d20 addressed this comment by removing the executor-specific _classify_tool_request and now keeping the deny/ask/denylist classifier only in backend/cliff/agents/runtime/tools/permissions.py, which is used by the new gate_tool_call enforcement. The old OpenCode permission queue path (and its duplicated classifier) was replaced with the Pydantic AI executor path, so there is no longer a second copy to drift apart.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional and temporary during the migration — called out in the permissions.py module docstring. The OpenCode-era executor._classify_tool_request is deleted in PR #3 of this stack (chore/drop-opencode-substrate), at which point permissions.classify_tool_request is the single source. Both copies are byte-identical and covered by the same tests until then, so there's no drift window in practice. Consolidating now would mean importing the new module back into code that's about to be deleted.

Comment thread backend/cliff/agents/executor.py Outdated
Comment on lines +655 to +658
approvals = reqs.approvals
tool_call_ids = [p.tool_call_id for p in approvals]
primary = approvals[0]
meta = reqs.metadata.get(primary.tool_call_id, {}) if reqs.metadata else {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeferredToolRequests.approvals can be empty, so approvals[0] may raise IndexError and skip _park_for_permission(); should we handle the empty-list case and build the marker from whatever tool-call data is present instead of indexing the first approval?

Severity web_search

Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
backend/cliff/agents/executor.py around lines 651-663, in the
`_build_permission_marker(reqs: DeferredToolRequests)` logic, stop unconditionally doing
`primary = approvals[0]` and handling `approvals` as if it is always non-empty. Refactor
`_build_permission_marker` to explicitly handle the `approvals == []` case by building a
fallback `permission_request` marker without indexing (prefer inferring `tool_call_ids`
from `reqs.metadata` when available; otherwise use an empty `tool_call_ids` list and
safe defaults like `tool="unknown"` and `patterns=[]`). This ensures `_run_pa_executor`
can return `_PaExecutorOutcome(permission_request=...)` and `_park_for_permission()`
persists `permission_pending`/`pa_message_history`, so the run ends up in
`awaiting_permission` rather than failing with `IndexError`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 9706d20 addressed this comment by refactoring _build_permission_marker to guard when DeferredToolRequests.approvals is empty, avoiding approvals[0] and instead returning a safe fallback marker (with default tool, patterns, and inferred tool_call_ids) so the permission flow can pause without crashing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9706d20. _build_permission_marker no longer indexes unconditionally — it guards the empty-approvals case and builds a fallback marker (tool='unknown', patterns=[], tool_call_ids=[]) so an approvals-empty DeferredToolRequests parks cleanly instead of raising IndexError. In practice Cliff's gated tools always raise ApprovalRequired (→ non-empty approvals); the guard covers PA's external-calls mechanism, which Cliff doesn't use. Covered by two new unit tests in test_deferred_tools_persist.py.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update—guarding _build_permission_marker when there are no approvals and adding the fallback marker keeps the pause/resume flow safe even for PA’s external pause API.

Comment on lines 515 to +526
run_id: str,
body: PermissionDecision,
request: Request,
db=Depends(get_db),
):
"""Approve or deny a pending tool-use permission request."""
"""Approve or deny a paused executor tool call, resuming the run.

ADR-0047 / PR #2 — the remediation_executor pauses on a gated tool by
persisting a ``DeferredToolRequests`` marker (``agent_run.permission_
request`` + ``pa_message_history``). This endpoint resumes the run via
``executor.resume_executor`` with the user's decision. Resume actually
re-runs the agent (it can take minutes), so it runs as a background
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we open a fresh DB session inside _resume_in_background instead of closing over db=Depends(get_db), otherwise the request-scoped connection is gone by the time executor.resume_executor writes and the background task can hit ProgrammingError/DatabaseError?

Severity

Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
backend/cliff/api/routes/agent_execution.py around lines 515-561, in the
`permission_decision` POST handler, you create an asyncio background task
(`_resume_in_background`) that closes over the request-scoped `db` dependency
(`db=Depends(get_db)`). Refactor so the background task does NOT use the request’s
`db` session; instead, open a fresh DB session inside `_resume_in_background` (e.g.,
call `get_db` again / use a session factory) and ensure it’s closed after
`executor.resume_executor` completes. This guarantees the resume/update logic runs with
a valid connection even after the HTTP response returns.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable to Cliff's connection model. get_db yields a single process-global aiosqlite connection opened once at init_db() and closed only at shutdown (close_db()) — there's no per-request session that gets torn down when the response returns (see cliff/db/connection.py: the dependency just yield _dbs the shared singleton with no teardown). So the closure in _resume_in_background stays valid for the life of the process. This matches the existing background-task pattern in execute/run-all, which also close over db.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense given Cliff’s process-global aiosqlite connection model. I’ll save this to memory once the PR is merged.

Comment on lines +532 to +540
run = await get_agent_run(db, run_id)
if run is None or not run.permission_pending:
raise HTTPException(
status_code=404,
detail="No pending permission request for this agent run",
)
workspace = await get_workspace(db, workspace_id)
if workspace is None or not workspace.workspace_dir:
raise HTTPException(status_code=404, detail="Workspace not found")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

respond_to_permission should verify run.workspace_id == workspace_id before executor.resume_executor, otherwise a paused run can resume in the wrong workspace context — should we add the same ownership guard as cancel_agent_run?

Severity

Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
backend/cliff/api/routes/agent_execution.py around lines 532-540 inside the
`respond_to_permission` endpoint, add a workspace/run ownership guard: after `run =
await get_agent_run(db, run_id)` and `workspace = await get_workspace(db,
workspace_id)`, verify that `run.workspace_id == workspace_id` (or the equivalent field
on the run model). If they don’t match, raise an HTTPException (404 is consistent with
the other “not found pending request” behavior) and do not schedule
`_resume_in_background`. This prevents a client from resuming a paused run from a
different workspace with the wrong `workspace_dir`/env vars.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 9706d20 addressed this comment by adding a workspace ownership guard in respond_to_permission, ensuring run.workspace_id == workspace_id before resuming the executor and returning 404 when the run belongs to another workspace.

Comment on lines +552 to 556
deny_message=body.deny_message,
env_vars=env_vars,
)
except Exception:
logger.exception(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GET /workspaces/{workspace_id}/agent-execution/stream looks like the only SSE path for agent_run_started/completed, permission_request, and done; should we keep it behind a deprecation window or document the agent-runs replacement in the public API/changelog for external EventSource consumers?

Severity

Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
`backend/cliff/api/routes/agent_execution.py` around lines 552-648 where the
`stream_agent_execution` GET SSE endpoint
(`/workspaces/{workspace_id}/agent-execution/stream`) was removed, re-add a
compatibility shim instead of dropping the only channel that emits `agent_run_started`,
`agent_run_completed`, `permission_request`, and `done`. Implement the endpoint again
(or keep it behind a short deprecation window/feature flag) so existing EventSource
clients still receive the same event shapes, and if it will be removed later, return a
clear `Deprecation`/`Sunset`-style response header or an explicit message pointing to
the polling replacement (`agent-runs`). Also add a public migration note in the API
changelog/docs mentioning the deprecation window and the replacement path for external
SSE consumers.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't-fix for the community edition. Cliff CE is single-user and self-hosted — the React frontend is the sole consumer of this endpoint, and it's switched to agent-runs polling in this same PR (PR2.D). There are no external EventSource clients to keep a deprecation window for, so the channel is removed rather than sunset. The SSE removal + polling replacement will be noted in the release CHANGELOG. A hardened multi-tenant deployment with third-party SSE consumers would warrant a deprecation window, but that's out of scope here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense for CE. I'll save this to memory once the PR is merged.

galanko and others added 2 commits June 1, 2026 17:24
Security / correctness:
- read tool: refuse paths resolving outside the workspace (was reading
  arbitrary host files); add shared escapes_workspace() helper in
  permissions.py, used by both read and edit.
- edit tool: fix relative_to() crash on an approved escaping write —
  fall back to absolute path in the confirmation message.
- webfetch: stream the response, enforce the content-type allowlist from
  headers before reading any body, and stop at _MAX_BODY_BYTES so a huge
  or binary body can't blow memory/context.
- agent-execution permission route: add workspace-ownership check so a
  run parked in another workspace can't be resumed via this workspace's
  URL (and with its env).
- mcp adapter: reject a non-list `command` (a string would be split into
  characters) with a clear TypeError; skipped, not mangled.
- executor: guard the typed-Optional parse_result so a contract violation
  surfaces as a failed run, not an AttributeError in _finalize_run.

Tests:
- edit: split the misnamed "write outside" test into a real
  approved-escaping-write case (exercises the relative_to fallback) and
  an inside-write case.
- read: refuse-escape test (symlink) asserting host contents don't leak.
- mcp: string-command-skipped test.
- webfetch: oversize-body truncation test.
- permission route: replace sleep(0.02) with a deterministic asyncio.Event
  wait; add cross-workspace-ownership 404 test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
baz-reviewer flagged that `approvals[0]` would IndexError if a
DeferredToolRequests arrives with an empty `approvals` list. Cliff's gated
tools always raise ApprovalRequired (→ non-empty approvals), but PA also
models external deferred `calls` (a mechanism Cliff doesn't use); guard
the empty case so the run parks cleanly with a fallback marker instead of
crashing out of the dispatch path.

Tests: marker built from a real approval; marker from an empty
DeferredToolRequests returns safe defaults without indexing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@galanko
Copy link
Copy Markdown
Collaborator Author

galanko commented Jun 1, 2026

Review round addressed — 9706d20

Thanks both bots. Summary of how each finding was handled.

Fixed (commits f546349 + 9706d20)

Finding Resolution
read reads arbitrary host files Refuse paths resolving outside the workspace; shared escapes_workspace() helper
edit relative_to() crash on approved escaping write Fall back to absolute path in the confirmation message
webfetch buffers full body before checks Stream; enforce content-type from headers first; stop at the byte cap
permission route missing ownership check run.workspace_id != workspace_id → 404
mcp command not validated Reject non-list command with a clear TypeError; Iterator[None] annotation
executor unguarded parse_result Guard the typed Optional → failed run, not AttributeError
_build_permission_marker approvals[0] IndexError Guard empty-approvals; fallback marker
edit "approved-outside" test didn't exercise the path Split into real escaping-write + inside-write cases
sleep-based test sync Deterministic asyncio.Event

New tests: read refuse-escape (symlink, no host-content leak), mcp string-command-skipped, webfetch oversize-truncation, cross-workspace-ownership 404, _build_permission_marker normal + empty. 93 affected tests pass, ruff clean.

Declined / deferred (rationale on each thread)

  • SSE deprecation window — CE is single-user self-hosted; the frontend is the sole consumer and was migrated to agent-runs polling in this PR. No external EventSource clients to sunset for. Noted for the CHANGELOG.
  • Fresh DB session in the background taskget_db yields a process-global singleton connection (closed only at shutdown), so the closure stays valid; matches the existing execute/run-all pattern.
  • awaiting_permission in AgentRunStatus — never written to the DB as a status; it's only the in-memory AgentExecutionResult.status. The parked row stays running + permission_pending.
  • Duplicate classifier in executor.py — intentional during the migration; the OpenCode copy is deleted in PR Switch license from MIT to AGPL-3.0 for open-core model #3 of this stack. Documented in the permissions.py docstring.
  • bash process-group kill + bounded streaming — real hardening but orthogonal to the migration (verbatim port of existing behavior); backlog follow-up.
  • Strict type hints on test helpersANN isn't in the repo's ruff select; test helpers are untyped by convention. A repo-wide typing change is separate cleanup.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/tests/agents/tools/test_tools.py (1)

20-27: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider adding type hints to the _ctx helper.

The workspace_dir and env_vars parameters lack type annotations. While test helpers often have relaxed typing, adding hints improves maintainability.

-def _ctx(workspace_dir, *, approved: bool = False, env_vars=None) -> SimpleNamespace:
+def _ctx(
+    workspace_dir: str | Path,
+    *,
+    approved: bool = False,
+    env_vars: dict[str, str] | None = None,
+) -> SimpleNamespace:

As per coding guidelines, "Use Python 3.11+ with strict type hints and Pydantic models for backend code".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/tests/agents/tools/test_tools.py` around lines 20 - 27, Add Python
type hints to the _ctx helper: annotate workspace_dir as pathlib.Path | str (or
simply Path) and env_vars as dict[str, str] | None, and annotate the return type
as SimpleNamespace. Update the function signature for _ctx and add any necessary
imports (e.g., from pathlib import Path and from typing import Optional, Dict)
so the signature reads something like _ctx(workspace_dir: Path | str, *,
approved: bool = False, env_vars: Optional[Dict[str, str]] = None) ->
SimpleNamespace, keeping the existing body and behavior unchanged and
referencing the same WorkspaceDeps construction.
backend/cliff/agents/runtime/tools/mcp.py (1)

103-110: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate url type before constructing MCPServerStreamableHTTP.

The past review flagged that url should be validated as a non-empty string, but that check was not added. A malformed url (e.g., None, empty string, or wrong type) will produce an invalid remote toolset or raise downstream.

🛡️ Proposed fix to add url validation
     if server_type == "remote":
         url = cfg["url"]
+        if not isinstance(url, str) or not url.strip():
+            raise ValueError(
+                f"remote MCP config 'url' must be a non-empty string, got "
+                f"{type(url).__name__}"
+            )
         with _suppress_mcp_v2_deprecation():
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/cliff/agents/runtime/tools/mcp.py` around lines 103 - 110, When
server_type == "remote" validate cfg["url"] is a non-empty string before calling
MCPServerStreamableHTTP: retrieve url = cfg["url"], check isinstance(url, str)
and url.strip() is not empty, and if the check fails raise a clear ValueError
(or log and raise) indicating an invalid remote url for entry_id; only then call
_suppress_mcp_v2_deprecation() and construct MCPServerStreamableHTTP(url=url,
headers=cfg.get("headers") or None, tool_prefix=entry_id).
♻️ Duplicate comments (3)
backend/tests/test_routes_agent_execution.py (2)

746-848: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add strict return type annotations to async test methods.

The async test methods in TestPermissionEndpoint lack explicit -> None return annotations. As per coding guidelines, backend code must use Python 3.11+ with strict type hints.

♻️ Proposed fix to add return annotations
     `@pytest.mark.asyncio`
-    async def test_approve_resumes_executor(self, app, client):
+    async def test_approve_resumes_executor(
+        self, app: FastAPI, client: AsyncClient
+    ) -> None:

     `@pytest.mark.asyncio`
-    async def test_deny_resumes_executor_with_denied(self, app, client):
+    async def test_deny_resumes_executor_with_denied(
+        self, app: FastAPI, client: AsyncClient
+    ) -> None:

     `@pytest.mark.asyncio`
-    async def test_no_pending_returns_404(self, app, client):
+    async def test_no_pending_returns_404(
+        self, app: FastAPI, client: AsyncClient
+    ) -> None:

     `@pytest.mark.asyncio`
-    async def test_run_owned_by_other_workspace_returns_404(self, app, client):
+    async def test_run_owned_by_other_workspace_returns_404(
+        self, app: FastAPI, client: AsyncClient
+    ) -> None:

As per coding guidelines, backend/**/*.py: Use Python 3.11+ with strict type hints and Pydantic models for backend code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/tests/test_routes_agent_execution.py` around lines 746 - 848, The
async test methods in TestPermissionEndpoint (e.g.,
test_approve_resumes_executor, test_deny_resumes_executor_with_denied,
test_no_pending_returns_404, test_run_owned_by_other_workspace_returns_404) are
missing explicit return type annotations; update each async def to include ->
None (e.g., async def test_approve_resumes_executor(self, app, client) -> None:)
to satisfy the strict Python 3.11+ typing guideline throughout the test class.

709-743: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add strict type hints to the permission route helper.

The helper _permission_route_patches lacks an explicit return type annotation. As per coding guidelines, backend code must use Python 3.11+ with strict type hints.

♻️ Proposed fix to add return type
+from typing import Iterator
+from unittest.mock import _patch

-def _permission_route_patches(*, pending: bool, run_workspace_id: str = "ws-1"):
+def _permission_route_patches(
+    *, pending: bool, run_workspace_id: str = "ws-1"
+) -> tuple[_patch[Any], _patch[Any], _patch[Any]]:

As per coding guidelines, backend/**/*.py: Use Python 3.11+ with strict type hints and Pydantic models for backend code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/tests/test_routes_agent_execution.py` around lines 709 - 743, The
helper _permission_route_patches has no return type annotation; add a strict
type hint on its signature like def _permission_route_patches(*, pending: bool,
run_workspace_id: str = "ws-1") -> tuple[object, object, object]: to explicitly
declare it returns a 3-tuple of patch objects (or replace object with a more
specific mock type if you prefer), and keep the body unchanged; this satisfies
the backend strict typing rule while preserving references to AgentRun,
SimpleNamespace, AsyncMock and patch used inside the function.
backend/cliff/agents/executor.py (1)

1636-1649: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard outcome.parse_result before calling _finalize_run in resume path.

The resume_executor path calls _finalize_run with outcome.parse_result (line 1644) without verifying it is not None. The _PaExecutorOutcome contract allows parse_result=None, and a contract violation would crash _finalize_run with an AttributeError instead of surfacing as a failed run. The execute() path added this guard (lines 851-855), but the resume path is still missing it.

🐛 Proposed fix to add guard
             if outcome.permission_request is not None:
                 return await self._park_for_permission(
                     db, workspace_id, agent_run, outcome, start_time
                 )
+            if outcome.parse_result is None:
+                raise AgentProcessError(
+                    "executor outcome missing parse_result for completed resume"
+                )
             return await self._finalize_run(
                 db,
                 workspace_id=workspace_id,
                 agent_type=agent_run.agent_type,
                 parse_result=outcome.parse_result,

As per coding guidelines, backend/**/*.py: Use Python 3.11+ with strict type hints and Pydantic models for backend code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/cliff/agents/executor.py` around lines 1636 - 1649, The resume path
calls _finalize_run with outcome.parse_result without checking for None; update
the resume_executor flow so that before calling _finalize_run (the branch that
currently mirrors lines 1636-1649), you guard outcome.parse_result the same way
as in execute()—if outcome.parse_result is None, treat the run as failed/handle
appropriately (e.g., return a failed run result or raise a controlled error)
instead of passing None into _finalize_run; reference outcome.parse_result,
_finalize_run, and the resume path that currently calls _park_for_permission to
locate where to add the guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@backend/cliff/agents/runtime/tools/mcp.py`:
- Around line 103-110: When server_type == "remote" validate cfg["url"] is a
non-empty string before calling MCPServerStreamableHTTP: retrieve url =
cfg["url"], check isinstance(url, str) and url.strip() is not empty, and if the
check fails raise a clear ValueError (or log and raise) indicating an invalid
remote url for entry_id; only then call _suppress_mcp_v2_deprecation() and
construct MCPServerStreamableHTTP(url=url, headers=cfg.get("headers") or None,
tool_prefix=entry_id).

In `@backend/tests/agents/tools/test_tools.py`:
- Around line 20-27: Add Python type hints to the _ctx helper: annotate
workspace_dir as pathlib.Path | str (or simply Path) and env_vars as dict[str,
str] | None, and annotate the return type as SimpleNamespace. Update the
function signature for _ctx and add any necessary imports (e.g., from pathlib
import Path and from typing import Optional, Dict) so the signature reads
something like _ctx(workspace_dir: Path | str, *, approved: bool = False,
env_vars: Optional[Dict[str, str]] = None) -> SimpleNamespace, keeping the
existing body and behavior unchanged and referencing the same WorkspaceDeps
construction.

---

Duplicate comments:
In `@backend/cliff/agents/executor.py`:
- Around line 1636-1649: The resume path calls _finalize_run with
outcome.parse_result without checking for None; update the resume_executor flow
so that before calling _finalize_run (the branch that currently mirrors lines
1636-1649), you guard outcome.parse_result the same way as in execute()—if
outcome.parse_result is None, treat the run as failed/handle appropriately
(e.g., return a failed run result or raise a controlled error) instead of
passing None into _finalize_run; reference outcome.parse_result, _finalize_run,
and the resume path that currently calls _park_for_permission to locate where to
add the guard.

In `@backend/tests/test_routes_agent_execution.py`:
- Around line 746-848: The async test methods in TestPermissionEndpoint (e.g.,
test_approve_resumes_executor, test_deny_resumes_executor_with_denied,
test_no_pending_returns_404, test_run_owned_by_other_workspace_returns_404) are
missing explicit return type annotations; update each async def to include ->
None (e.g., async def test_approve_resumes_executor(self, app, client) -> None:)
to satisfy the strict Python 3.11+ typing guideline throughout the test class.
- Around line 709-743: The helper _permission_route_patches has no return type
annotation; add a strict type hint on its signature like def
_permission_route_patches(*, pending: bool, run_workspace_id: str = "ws-1") ->
tuple[object, object, object]: to explicitly declare it returns a 3-tuple of
patch objects (or replace object with a more specific mock type if you prefer),
and keep the body unchanged; this satisfies the backend strict typing rule while
preserving references to AgentRun, SimpleNamespace, AsyncMock and patch used
inside the function.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: f50b62be-69c0-463d-bca7-116d1d15ff26

📥 Commits

Reviewing files that changed from the base of the PR and between c25f096 and 9706d20.

📒 Files selected for processing (11)
  • backend/cliff/agents/executor.py
  • backend/cliff/agents/runtime/tools/edit.py
  • backend/cliff/agents/runtime/tools/mcp.py
  • backend/cliff/agents/runtime/tools/permissions.py
  • backend/cliff/agents/runtime/tools/read.py
  • backend/cliff/agents/runtime/tools/webfetch.py
  • backend/cliff/api/routes/agent_execution.py
  • backend/tests/agents/test_deferred_tools_persist.py
  • backend/tests/agents/tools/test_mcp.py
  • backend/tests/agents/tools/test_tools.py
  • backend/tests/test_routes_agent_execution.py

@galanko galanko merged commit eb1a95e into main Jun 1, 2026
8 checks passed
@galanko galanko deleted the feat/pa-migrate-executor branch June 1, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant