feat(llm): add Claude Agent SDK adapter behind agent.runtime flag (migration, phase 1) by PalmPalm7 · Pull Request #24 · rhpds/parsec

PalmPalm7 · 2026-06-01T09:30:27Z

Second of the two phase-1 migration PRs. Additive, behind a feature flag that defaults to today's behavior; independent of the skills-loader PR (#23) and mergeable in any order.

What this does

Adds src/llm/ — a thin async adapter (AgentSdkClient) around claude_agent_sdk.query(), selected by a new agent.runtime: legacy | sdk flag that defaults to legacy (today's raw-messages.create() path). claude_agent_sdk is imported lazily, so the module ships fine without the SDK installed; only .complete() triggers the import and raises a clear AgentSdkUnavailableError if it's missing.

No behavior change: nothing calls the adapter yet — the orchestrator is untouched. This is the tested runtime scaffolding the orchestrator integration (a later phase) will build on, landed separately so it can be reviewed in isolation and can't affect production while flagged off.

What changed (6 files, +815; 2 commits)

Area	File	Lines	Purpose
Adapter	`src/llm/agent_sdk_client.py`	+293	`AgentSdkClient.complete()` — runs one `query()`, aggregates the streamed `AssistantMessage`/`UserMessage`/`ResultMessage` flow into `SdkResult`
Types	`src/llm/types.py`	+44	`SdkResult`, `SdkUsage` (frozen)
Runtime flag	`src/llm/runtime.py`	+43	`get_runtime(config)` → `legacy`\|`sdk`, unknown values coerced to `legacy`
API	`src/llm/__init__.py`	+39	public surface
Config	`config/config.yaml`	+14	new `agent:` section (runtime flag + sdk overrides)
Tests	`tests/test_agent_sdk_client.py`	+382	19 tests (fake SDK injected via `sys.modules`)

Commits: db7c8b7 (adapter) and af6d243 (fix — see below).

How to test — local

git fetch origin && git checkout migration/agent-sdk-adapter
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install black ruff mypy types-PyYAML

pytest tests/test_agent_sdk_client.py -v      # 19 tests; fake SDK, no network, no SDK install needed
black --check --line-length=100 src/
ruff check src/
mypy src/ --ignore-missing-imports

Optional — a real one-shot call (needs the SDK + an Anthropic key):

pip install claude-agent-sdk && export ANTHROPIC_API_KEY=sk-ant-...
python3 - <<'PY'
import asyncio
from src.llm import AgentSdkClient
from src.llm.agent_sdk_client import AgentSdkConfig
async def main():
    r = await AgentSdkClient(AgentSdkConfig(model="claude-sonnet-4-6")).complete(prompt="say hi in 3 words")
    print(r.succeeded, repr(r.text), r.model, r.usage.total_cost_usd)
asyncio.run(main())
PY

Result — local

pytest → 19 passed (click to expand)

$ pytest tests/test_agent_sdk_client.py -q
...................                                                      [100%]
19 passed in 0.15s

Covers: runtime flag (default, valid, unknown→legacy, Dynaconf-like); from_config (anthropic defaults, sdk overrides, empty config); complete() (lazy-import error path, text aggregation, tool_use↔tool_result pairing, error capture, options pass-through, omitted optionals, per-call max_turns); model/session captured from AssistantMessage; SdkResult.succeeded.

quality gates → all clean

$ ruff check src/llm tests/test_agent_sdk_client.py
All checks passed!
$ black --check src/llm tests/test_agent_sdk_client.py
5 files would be left unchanged.
$ mypy src/llm
Success: no issues found in 4 source files

bug caught + fixed during live API testing (commit af6d243)

A live call against the real Anthropic API failed an assertion: result.model came back None. Root cause — the real claude_agent_sdk ResultMessage has no model field (it carries usage/cost); the model lives on AssistantMessage. The adapter now captures model/session_id from AssistantMessage during the stream, with a regression test (test_complete_captures_model_from_assistant_even_when_result_has_none). This is exactly the kind of gap pure-mock testing misses — worth knowing the adapter has now been corrected against the real SDK shape.

How to test — staging (real OpenShift cluster, live company Vertex)

The standard Parsec image can't exercise this adapter: claude_agent_sdk shells out to the claude CLI (Node), which the image doesn't bundle, and nothing in the orchestrator calls the adapter yet. So a purpose-built image (this branch + Node + the claude CLI + claude-agent-sdk) was run as a one-off Job that calls AgentSdkClient.complete() against company Vertex AI (itpc-gcp-octo-eng-claude / global), using a GCP ADC mounted as a Secret.

# essence (full bundle available on request):
#  - build image: FROM ubi9/python-311 + dnf nodejs + npm i -g @anthropic-ai/claude-code
#                 + pip install claude-agent-sdk anthropic[vertex] + clone this branch
#  - Job env: CLAUDE_CODE_USE_VERTEX=1, ANTHROPIC_VERTEX_PROJECT_ID, CLOUD_ML_REGION=global,
#             GOOGLE_APPLICATION_CREDENTIALS=<mounted ADC>
#  - Job runs:
python3 -c "
import asyncio; from src.llm import AgentSdkClient
from src.llm.agent_sdk_client import AgentSdkConfig
print(asyncio.run(AgentSdkClient(AgentSdkConfig(model='claude-sonnet-4-5@20250929'))
      .complete(prompt='Reply with exactly: SDK ADAPTER OK. No tools.')))"

Result — staging

Job output → PASS, real Vertex answer (click to expand)

=== PR#2 adapter runtime test ===
model         : claude-sonnet-4-5@20250929
CLAUDE_CODE_USE_VERTEX=1
ANTHROPIC_VERTEX_PROJECT_ID=itpc-gcp-octo-eng-claude
CLOUD_ML_REGION=global
---
succeeded     : True
text          : 'SDK ADAPTER OK'
model         : claude-sonnet-4-5-20250929
session_id    : 142ccb80-95a9-4d28-a242-7aa44387a2fc
tool calls    : 0
usage         : in=10 out=68 cost=$0.093987 turns=1
---
RESULT: PASS

The SDK spawned the claude CLI, authenticated to Vertex, and the adapter correctly aggregated the stream into SdkResult. The returned claude-sonnet-4-5-20250929 (SDK normalizes the @date form) also re-confirms the af6d243 model-capture fix works through the real SDK path.

Cost data point for the migration discussion: that trivial call cost ~$0.094 — the SDK's system-prompt/tool-definition overhead vs. a bare messages.create(). Concrete input for the Phase-2 cost/parity benchmark (Risk #4 in the plan).

Notes for the reviewer

Default agent.runtime: legacy means this is dead, safe code until a future PR adds the orchestrator dispatcher — intentional, so it lands low-risk.
claude-agent-sdk is intentionally not in requirements.txt (lazy import) so the image/CI don't pull Node + the CLI yet; that arrives with the integration PR as an optional extra.
Same CI footnote as feat(skills): add SKILL.md loader + GET /api/skills (Agent SDK migration, phase 1) #23: pytest isn't in ci.yml today, so these 19 tests won't run at merge unless we add a step — glad to.

🤖 Generated with Claude Code

…flag Introduces src.llm package with: - AgentSdkClient: thin async wrapper around claude_agent_sdk.query() that aggregates the streamed AssistantMessage/UserMessage/ResultMessage flow into a single SdkResult (text + tool invocations + usage) - get_runtime() helper reading the new agent.runtime config flag (legacy | sdk), defaulting to legacy so existing deploys are unchanged - Lazy import of claude_agent_sdk: the adapter file is importable without the package installed; calls raise AgentSdkUnavailableError with install hint instead Tests inject a fake sdk module via sys.modules so coverage exercises the message-stream aggregation, options pass-through, error capture, and config defaults without requiring the real SDK in CI. No orchestrator changes — additive only. Phase 2 of the migration will wire AgentSdkClient into a new AgentRunner dispatcher that reads agent.runtime. See artifacts/parsec-agent-sdk-migration-plan.md.

Real-SDK regression caught during live smoke test against api.anthropic.com on 2026-05-21. The adapter was reading ResultMessage.model, but the actual claude_agent_sdk ResultMessage (>=0.2.x) has no model field at all — model lives on AssistantMessage. Result: SdkResult.model came back as None for every real call. Fix: - Capture model + session_id from AssistantMessage during the stream (last value wins, in case the conversation switches models). - ResultMessage still provides session_id as fallback + the authoritative usage/cost/num_turns. - Refactor the message loop into _ingest_assistant / _ingest_user / _ingest_result helpers to keep complete() under the mccabe complexity limit after the added branches. Added regression test test_complete_captures_model_from_assistant_even_when_result_has_none to lock in the new contract. 19 tests passing (was 18). All quality gates (ruff/black/mypy) clean.

rut31337

Review: Request Changes

The adapter design is solid — lazy import, feature flag defaulting to legacy, frozen result types, and 19 tests. Four issues to address:

1. High: Incorrect session_id fallback (`src/llm/agent_sdk_client.py:233-234`)

state["session_id"] = getattr(message, "session_id", None) or state["session_id"]

The or state["session_id"] fallback is semantically incorrect — per the SDK spec, ResultMessage.session_id is a required str field, not optional. This silently keeps a stale session_id if ResultMessage.session_id were somehow empty. Drop the fallback:

state["session_id"] = getattr(message, "session_id", None)

Or if defensive:

result_session = getattr(message, "session_id", None)
if result_session:
    state["session_id"] = result_session

2. High: Fragile model capture (`src/llm/agent_sdk_client.py:220-226`)

"Last assistant message wins" with no validation. If a future SDK version streams multiple AssistantMessages with different model values, the code silently overwrites. Should log a warning on inconsistency:

if msg_model:
    if state["model"] and state["model"] != msg_model:
        logger.warning("Model changed mid-conversation: %s -> %s", state["model"], msg_model)
    state["model"] = msg_model

3. High: No timeout/cancellation (`src/llm/agent_sdk_client.py:165-175`)

sdk.query() can run for many minutes (up to max_turns iterations with slow tools). The rest of the codebase uses timeouts consistently (orchestrator, agents, MCP connections). Without asyncio.timeout(), a hung SDK query leaks both the coroutine and the child CLI process.

Fix: Add a timeout parameter to complete():

async def complete(self, *, prompt: str, ..., timeout: float | None = None) -> SdkResult:
    ...
    try:
        if timeout:
            async with asyncio.timeout(timeout):
                async for message in sdk.query(prompt=prompt, options=options):
                    ...
        else:
            async for message in sdk.query(prompt=prompt, options=options):
                ...

4. Medium: Config access pattern mismatch (`src/llm/agent_sdk_client.py:206-217`)

The _get_section() helper uses .get() for top-level Dynaconf sections, but existing code uses attribute access (cfg.aws, cfg.anthropic). This works but is inconsistent and the dict[str, Any] | Any return type defeats type checking.

Existing pattern (e.g. src/connections/aws.py:19):

cfg = get_config()
aws_cfg = cfg.aws          # attribute access for top-level
profile = aws_cfg.get("profile")  # .get() for nested

Consider aligning to match.

The overall architecture (lazy import, feature flag, frozen types, clean test fixtures with fake SDK injection) is well-done. These fixes are straightforward and don't require structural changes.

…onfig fixes Addresses review on rhpds#24: - No timeout/cancellation: a hung sdk.query() leaked the coroutine and the child CLI process. complete() now takes a `timeout` (seconds), defaulting to the new `agent.sdk.timeout` config (300s; null/0 disables). The query loop is wrapped in asyncio.timeout(), and on expiry the result is marked an error instead of hanging. asyncio.timeout(None) is a no-op, so one code path covers bounded + unbounded. - Fragile model capture: warn when AssistantMessage.model changes mid-conversation instead of silently letting "last wins". - Incorrect session_id fallback: the `or state[...]` form read as falling back to a stale id; replaced with an explicit truthy guard that takes the ResultMessage id when present without clobbering the AssistantMessage id. - Config access mismatch: _get_section now coerces to a real dict (mirrors skills loader's _get_skills_section), giving a concrete dict[str, Any] return type. Documented why attribute access (cfg.agent) isn't used — from_config must also accept the plain-dict configs the tests pass. Adds tests: per-call + config-default timeout, model-change warning, session_id preservation, and the timeout config knob. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

PalmPalm7 · 2026-06-03T16:20:01Z

Thanks for the review, Patrick! All four findings are addressed in 0f5390b.

1. Incorrect session_id fallback — replaced the or state["session_id"] form with an explicit truthy guard: _ingest_result takes ResultMessage.session_id when present, without clobbering the id already captured from AssistantMessage if it's somehow empty. (Went with the defensive variant you offered.)

2. Fragile model capture — _ingest_assistant now logs a warning when AssistantMessage.model changes mid-conversation (last still wins).

3. No timeout/cancellation — complete() takes a timeout (seconds), defaulting to a new agent.sdk.timeout config (300; null/0 disables). The query loop is wrapped in asyncio.timeout(); on expiry the result is marked an error instead of hanging and leaking the coroutine + child CLI process. asyncio.timeout(None) is a no-op, so one code path covers the bounded and unbounded cases.

4. Config access pattern — _get_section now coerces the section to a real dict (mirroring src.skills.loader._get_skills_section), which fixes the dict | Any return type. I kept .get() rather than attribute access (cfg.agent) because from_config must also accept the plain-dict configs the tests pass — added a comment explaining that so it reads as intentional.

New tests cover the timeout (per-call override + config default), the model-change warning, and session_id preservation. Full local gate is green — black / ruff / mypy on src/ plus pytest (26 adapter tests, 63 total). CI green on this PR too.

rut31337

Re-review: Approve

All four issues from commit 0f5390b properly addressed:

1. Timeout/cancellation — FIXED

asyncio.timeout(effective_timeout) wraps the query loop
Config knob: agent.sdk.timeout: 300 (default 5 min; null/0 disables)
Per-call timeout parameter overrides config default
asyncio.timeout(None) is a no-op — one code path for bounded + unbounded
On timeout: is_error=True, informative error_message, warning logged
5 tests cover per-call override, config default, and disable behavior

2. Model-change warning — FIXED

Logs "Model changed mid-conversation: %s -> %s" when model differs between AssistantMessages
Still captures final model correctly
Test validates warning is logged

3. session_id fallback — FIXED

Removed fragile or state["session_id"] pattern
Replaced with explicit truthy guard: only overwrites if ResultMessage has a real session_id
Test validates AssistantMessage session_id is preserved when ResultMessage returns None

4. Config coercion — FIXED

_get_section() now returns dict[str, Any] (concrete type, not | Any)
Coerces Dynaconf Box via .to_dict(), falls back to dict(raw) for plain dicts
Docstring explains why .get() is used over attribute access (test compatibility)

26 tests total (was 19), all passing. No new concerns. Ready to merge.

rut31337 · 2026-06-03T16:53:40Z

PRs #22, #23, and #25 are now merged. This PR has a merge conflict with the updated main — likely in config/config.yaml where #23's skills: section was added adjacent to your agent: section.

Could you rebase against main to resolve? Should be a straightforward conflict in config/config.yaml — both sections need to coexist.

git fetch upstream
git rebase upstream/main
# resolve config/config.yaml conflict (keep both skills: and agent: sections)
git push --force-with-lease

Approved and ready to merge once the conflict is resolved.

Brings the SDK-adapter branch up to date with main (now includes the merged skills loader rhpds#23, Skills UI rhpds#25, contributing docs rhpds#22, and the admin/group-based-access changes). Conflict: config/config.yaml — main added the `skills:` section (via rhpds#23) and this branch added the `agent:`/`agent.sdk` section. Resolved by keeping both; they are independent and complementary (agent.runtime selects the runtime, skills.* configures SKILL.md discovery). Local gate green on the merged tree: black, ruff, mypy (src/) + pytest (87 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

PalmPalm7 added 2 commits May 19, 2026 05:14

PalmPalm7 requested a review from rut31337 June 1, 2026 09:30

PalmPalm7 mentioned this pull request Jun 1, 2026

feat(llm): add Claude Agent SDK adapter behind agent.runtime feature flag PalmPalm7/parsec#2

Open

6 tasks

PalmPalm7 requested a review from Shreyanand June 1, 2026 09:40

PalmPalm7 marked this pull request as draft June 1, 2026 10:16

PalmPalm7 removed request for Shreyanand and rut31337 June 1, 2026 10:16

PalmPalm7 marked this pull request as ready for review June 1, 2026 15:30

PalmPalm7 requested review from Shreyanand and rut31337 June 1, 2026 15:30

rut31337 requested changes Jun 2, 2026

View reviewed changes

PalmPalm7 removed the request for review from Shreyanand June 3, 2026 15:49

PalmPalm7 requested a review from rut31337 June 3, 2026 16:20

rut31337 previously approved these changes Jun 3, 2026

View reviewed changes

PalmPalm7 dismissed rut31337’s stale review via 6835602 June 3, 2026 17:24

PalmPalm7 requested a review from rut31337 June 3, 2026 17:25

rut31337 approved these changes Jun 3, 2026

View reviewed changes

rut31337 merged commit 23055be into rhpds:main Jun 3, 2026
3 checks passed

This was referenced Jun 4, 2026

feat(skills): ship sample Parsec-native skills + bake into image #27

Open

feat(agent): AgentRunner runtime dispatcher (Agent SDK migration, phase 2) #30

Draft

feat(llm): MLflow tracing for the Agent SDK subprocess (migration, phase 2) #31

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): add Claude Agent SDK adapter behind agent.runtime flag (migration, phase 1)#24

feat(llm): add Claude Agent SDK adapter behind agent.runtime flag (migration, phase 1)#24
rut31337 merged 4 commits into
rhpds:mainfrom
PalmPalm7:migration/agent-sdk-adapter

PalmPalm7 commented Jun 1, 2026 •

edited

Loading

Uh oh!

rut31337 left a comment

Uh oh!

PalmPalm7 commented Jun 3, 2026

Uh oh!

rut31337 left a comment

Uh oh!

rut31337 commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PalmPalm7 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

What changed (6 files, +815; 2 commits)

How to test — local

Result — local

How to test — staging (real OpenShift cluster, live company Vertex)

Result — staging

Notes for the reviewer

Uh oh!

rut31337 left a comment

Choose a reason for hiding this comment

Review: Request Changes

1. High: Incorrect session_id fallback (src/llm/agent_sdk_client.py:233-234)

2. High: Fragile model capture (src/llm/agent_sdk_client.py:220-226)

3. High: No timeout/cancellation (src/llm/agent_sdk_client.py:165-175)

4. Medium: Config access pattern mismatch (src/llm/agent_sdk_client.py:206-217)

Uh oh!

PalmPalm7 commented Jun 3, 2026

Uh oh!

rut31337 left a comment

Choose a reason for hiding this comment

Re-review: Approve

Uh oh!

rut31337 commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PalmPalm7 commented Jun 1, 2026 •

edited

Loading

1. High: Incorrect session_id fallback (`src/llm/agent_sdk_client.py:233-234`)

2. High: Fragile model capture (`src/llm/agent_sdk_client.py:220-226`)

3. High: No timeout/cancellation (`src/llm/agent_sdk_client.py:165-175`)

4. Medium: Config access pattern mismatch (`src/llm/agent_sdk_client.py:206-217`)