Skip to content

feat(llm): add Claude Agent SDK adapter behind agent.runtime flag (migration, phase 1)#24

Merged
rut31337 merged 4 commits into
rhpds:mainfrom
PalmPalm7:migration/agent-sdk-adapter
Jun 3, 2026
Merged

feat(llm): add Claude Agent SDK adapter behind agent.runtime flag (migration, phase 1)#24
rut31337 merged 4 commits into
rhpds:mainfrom
PalmPalm7:migration/agent-sdk-adapter

Conversation

@PalmPalm7

@PalmPalm7 PalmPalm7 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Second of the two phase-1 migration PRs. Additive, behind a feature flag that defaults to today's behavior; independent of the skills-loader PR (#23) and mergeable in any order.


What this does

Adds src/llm/ — a thin async adapter (AgentSdkClient) around claude_agent_sdk.query(), selected by a new agent.runtime: legacy | sdk flag that defaults to legacy (today's raw-messages.create() path). claude_agent_sdk is imported lazily, so the module ships fine without the SDK installed; only .complete() triggers the import and raises a clear AgentSdkUnavailableError if it's missing.

No behavior change: nothing calls the adapter yet — the orchestrator is untouched. This is the tested runtime scaffolding the orchestrator integration (a later phase) will build on, landed separately so it can be reviewed in isolation and can't affect production while flagged off.

What changed (6 files, +815; 2 commits)

Area File Lines Purpose
Adapter src/llm/agent_sdk_client.py +293 AgentSdkClient.complete() — runs one query(), aggregates the streamed AssistantMessage/UserMessage/ResultMessage flow into SdkResult
Types src/llm/types.py +44 SdkResult, SdkUsage (frozen)
Runtime flag src/llm/runtime.py +43 get_runtime(config)legacy|sdk, unknown values coerced to legacy
API src/llm/__init__.py +39 public surface
Config config/config.yaml +14 new agent: section (runtime flag + sdk overrides)
Tests tests/test_agent_sdk_client.py +382 19 tests (fake SDK injected via sys.modules)

Commits: db7c8b7 (adapter) and af6d243 (fix — see below).

How to test — local

git fetch origin && git checkout migration/agent-sdk-adapter
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install black ruff mypy types-PyYAML

pytest tests/test_agent_sdk_client.py -v      # 19 tests; fake SDK, no network, no SDK install needed
black --check --line-length=100 src/
ruff check src/
mypy src/ --ignore-missing-imports

Optional — a real one-shot call (needs the SDK + an Anthropic key):

pip install claude-agent-sdk && export ANTHROPIC_API_KEY=sk-ant-...
python3 - <<'PY'
import asyncio
from src.llm import AgentSdkClient
from src.llm.agent_sdk_client import AgentSdkConfig
async def main():
    r = await AgentSdkClient(AgentSdkConfig(model="claude-sonnet-4-6")).complete(prompt="say hi in 3 words")
    print(r.succeeded, repr(r.text), r.model, r.usage.total_cost_usd)
asyncio.run(main())
PY

Result — local

pytest → 19 passed (click to expand)
$ pytest tests/test_agent_sdk_client.py -q
...................                                                      [100%]
19 passed in 0.15s

Covers: runtime flag (default, valid, unknown→legacy, Dynaconf-like); from_config (anthropic defaults, sdk overrides, empty config); complete() (lazy-import error path, text aggregation, tool_use↔tool_result pairing, error capture, options pass-through, omitted optionals, per-call max_turns); model/session captured from AssistantMessage; SdkResult.succeeded.

quality gates → all clean
$ ruff check src/llm tests/test_agent_sdk_client.py
All checks passed!
$ black --check src/llm tests/test_agent_sdk_client.py
5 files would be left unchanged.
$ mypy src/llm
Success: no issues found in 4 source files
bug caught + fixed during live API testing (commit af6d243)

A live call against the real Anthropic API failed an assertion: result.model came back None. Root cause — the real claude_agent_sdk ResultMessage has no model field (it carries usage/cost); the model lives on AssistantMessage. The adapter now captures model/session_id from AssistantMessage during the stream, with a regression test (test_complete_captures_model_from_assistant_even_when_result_has_none). This is exactly the kind of gap pure-mock testing misses — worth knowing the adapter has now been corrected against the real SDK shape.

How to test — staging (real OpenShift cluster, live company Vertex)

The standard Parsec image can't exercise this adapter: claude_agent_sdk shells out to the claude CLI (Node), which the image doesn't bundle, and nothing in the orchestrator calls the adapter yet. So a purpose-built image (this branch + Node + the claude CLI + claude-agent-sdk) was run as a one-off Job that calls AgentSdkClient.complete() against company Vertex AI (itpc-gcp-octo-eng-claude / global), using a GCP ADC mounted as a Secret.

# essence (full bundle available on request):
#  - build image: FROM ubi9/python-311 + dnf nodejs + npm i -g @anthropic-ai/claude-code
#                 + pip install claude-agent-sdk anthropic[vertex] + clone this branch
#  - Job env: CLAUDE_CODE_USE_VERTEX=1, ANTHROPIC_VERTEX_PROJECT_ID, CLOUD_ML_REGION=global,
#             GOOGLE_APPLICATION_CREDENTIALS=<mounted ADC>
#  - Job runs:
python3 -c "
import asyncio; from src.llm import AgentSdkClient
from src.llm.agent_sdk_client import AgentSdkConfig
print(asyncio.run(AgentSdkClient(AgentSdkConfig(model='claude-sonnet-4-5@20250929'))
      .complete(prompt='Reply with exactly: SDK ADAPTER OK. No tools.')))"

Result — staging

Job output → PASS, real Vertex answer (click to expand)
=== PR#2 adapter runtime test ===
model         : claude-sonnet-4-5@20250929
CLAUDE_CODE_USE_VERTEX=1
ANTHROPIC_VERTEX_PROJECT_ID=itpc-gcp-octo-eng-claude
CLOUD_ML_REGION=global
---
succeeded     : True
text          : 'SDK ADAPTER OK'
model         : claude-sonnet-4-5-20250929
session_id    : 142ccb80-95a9-4d28-a242-7aa44387a2fc
tool calls    : 0
usage         : in=10 out=68 cost=$0.093987 turns=1
---
RESULT: PASS

The SDK spawned the claude CLI, authenticated to Vertex, and the adapter correctly aggregated the stream into SdkResult. The returned claude-sonnet-4-5-20250929 (SDK normalizes the @date form) also re-confirms the af6d243 model-capture fix works through the real SDK path.

Cost data point for the migration discussion: that trivial call cost ~$0.094 — the SDK's system-prompt/tool-definition overhead vs. a bare messages.create(). Concrete input for the Phase-2 cost/parity benchmark (Risk #4 in the plan).

Notes for the reviewer

  • Default agent.runtime: legacy means this is dead, safe code until a future PR adds the orchestrator dispatcher — intentional, so it lands low-risk.
  • claude-agent-sdk is intentionally not in requirements.txt (lazy import) so the image/CI don't pull Node + the CLI yet; that arrives with the integration PR as an optional extra.
  • Same CI footnote as feat(skills): add SKILL.md loader + GET /api/skills (Agent SDK migration, phase 1) #23: pytest isn't in ci.yml today, so these 19 tests won't run at merge unless we add a step — glad to.

🤖 Generated with Claude Code

PalmPalm7 added 2 commits May 19, 2026 05:14
…flag

Introduces src.llm package with:
- AgentSdkClient: thin async wrapper around claude_agent_sdk.query()
  that aggregates the streamed AssistantMessage/UserMessage/ResultMessage
  flow into a single SdkResult (text + tool invocations + usage)
- get_runtime() helper reading the new agent.runtime config flag
  (legacy | sdk), defaulting to legacy so existing deploys are unchanged
- Lazy import of claude_agent_sdk: the adapter file is importable
  without the package installed; calls raise AgentSdkUnavailableError
  with install hint instead

Tests inject a fake sdk module via sys.modules so coverage exercises
the message-stream aggregation, options pass-through, error capture,
and config defaults without requiring the real SDK in CI.

No orchestrator changes — additive only. Phase 2 of the migration
will wire AgentSdkClient into a new AgentRunner dispatcher that reads
agent.runtime. See artifacts/parsec-agent-sdk-migration-plan.md.
Real-SDK regression caught during live smoke test against
api.anthropic.com on 2026-05-21. The adapter was reading
ResultMessage.model, but the actual claude_agent_sdk ResultMessage
(>=0.2.x) has no model field at all — model lives on AssistantMessage.
Result: SdkResult.model came back as None for every real call.

Fix:
- Capture model + session_id from AssistantMessage during the stream
  (last value wins, in case the conversation switches models).
- ResultMessage still provides session_id as fallback + the
  authoritative usage/cost/num_turns.
- Refactor the message loop into _ingest_assistant / _ingest_user /
  _ingest_result helpers to keep complete() under the mccabe complexity
  limit after the added branches.

Added regression test test_complete_captures_model_from_assistant_even_when_result_has_none
to lock in the new contract.

19 tests passing (was 18). All quality gates (ruff/black/mypy) clean.
@PalmPalm7 PalmPalm7 requested a review from rut31337 June 1, 2026 09:30
@PalmPalm7 PalmPalm7 requested a review from Shreyanand June 1, 2026 09:40
@PalmPalm7 PalmPalm7 marked this pull request as draft June 1, 2026 10:16
@PalmPalm7 PalmPalm7 marked this pull request as ready for review June 1, 2026 15:30
@PalmPalm7 PalmPalm7 requested review from Shreyanand and rut31337 June 1, 2026 15:30

@rut31337 rut31337 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Request Changes

The adapter design is solid — lazy import, feature flag defaulting to legacy, frozen result types, and 19 tests. Four issues to address:


1. High: Incorrect session_id fallback (src/llm/agent_sdk_client.py:233-234)

state["session_id"] = getattr(message, "session_id", None) or state["session_id"]

The or state["session_id"] fallback is semantically incorrect — per the SDK spec, ResultMessage.session_id is a required str field, not optional. This silently keeps a stale session_id if ResultMessage.session_id were somehow empty. Drop the fallback:

state["session_id"] = getattr(message, "session_id", None)

Or if defensive:

result_session = getattr(message, "session_id", None)
if result_session:
    state["session_id"] = result_session

2. High: Fragile model capture (src/llm/agent_sdk_client.py:220-226)

"Last assistant message wins" with no validation. If a future SDK version streams multiple AssistantMessages with different model values, the code silently overwrites. Should log a warning on inconsistency:

if msg_model:
    if state["model"] and state["model"] != msg_model:
        logger.warning("Model changed mid-conversation: %s -> %s", state["model"], msg_model)
    state["model"] = msg_model

3. High: No timeout/cancellation (src/llm/agent_sdk_client.py:165-175)

sdk.query() can run for many minutes (up to max_turns iterations with slow tools). The rest of the codebase uses timeouts consistently (orchestrator, agents, MCP connections). Without asyncio.timeout(), a hung SDK query leaks both the coroutine and the child CLI process.

Fix: Add a timeout parameter to complete():

async def complete(self, *, prompt: str, ..., timeout: float | None = None) -> SdkResult:
    ...
    try:
        if timeout:
            async with asyncio.timeout(timeout):
                async for message in sdk.query(prompt=prompt, options=options):
                    ...
        else:
            async for message in sdk.query(prompt=prompt, options=options):
                ...

4. Medium: Config access pattern mismatch (src/llm/agent_sdk_client.py:206-217)

The _get_section() helper uses .get() for top-level Dynaconf sections, but existing code uses attribute access (cfg.aws, cfg.anthropic). This works but is inconsistent and the dict[str, Any] | Any return type defeats type checking.

Existing pattern (e.g. src/connections/aws.py:19):

cfg = get_config()
aws_cfg = cfg.aws          # attribute access for top-level
profile = aws_cfg.get("profile")  # .get() for nested

Consider aligning to match.


The overall architecture (lazy import, feature flag, frozen types, clean test fixtures with fake SDK injection) is well-done. These fixes are straightforward and don't require structural changes.

@PalmPalm7 PalmPalm7 removed the request for review from Shreyanand June 3, 2026 15:49
…onfig fixes

Addresses review on rhpds#24:

- No timeout/cancellation: a hung sdk.query() leaked the coroutine and the
  child CLI process. complete() now takes a `timeout` (seconds), defaulting to
  the new `agent.sdk.timeout` config (300s; null/0 disables). The query loop is
  wrapped in asyncio.timeout(), and on expiry the result is marked an error
  instead of hanging. asyncio.timeout(None) is a no-op, so one code path covers
  bounded + unbounded.
- Fragile model capture: warn when AssistantMessage.model changes
  mid-conversation instead of silently letting "last wins".
- Incorrect session_id fallback: the `or state[...]` form read as falling back
  to a stale id; replaced with an explicit truthy guard that takes the
  ResultMessage id when present without clobbering the AssistantMessage id.
- Config access mismatch: _get_section now coerces to a real dict (mirrors
  skills loader's _get_skills_section), giving a concrete dict[str, Any] return
  type. Documented why attribute access (cfg.agent) isn't used — from_config
  must also accept the plain-dict configs the tests pass.

Adds tests: per-call + config-default timeout, model-change warning,
session_id preservation, and the timeout config knob.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@PalmPalm7

Copy link
Copy Markdown
Contributor Author

Thanks for the review, Patrick! All four findings are addressed in 0f5390b.

1. Incorrect session_id fallback — replaced the or state["session_id"] form with an explicit truthy guard: _ingest_result takes ResultMessage.session_id when present, without clobbering the id already captured from AssistantMessage if it's somehow empty. (Went with the defensive variant you offered.)

2. Fragile model capture_ingest_assistant now logs a warning when AssistantMessage.model changes mid-conversation (last still wins).

3. No timeout/cancellationcomplete() takes a timeout (seconds), defaulting to a new agent.sdk.timeout config (300; null/0 disables). The query loop is wrapped in asyncio.timeout(); on expiry the result is marked an error instead of hanging and leaking the coroutine + child CLI process. asyncio.timeout(None) is a no-op, so one code path covers the bounded and unbounded cases.

4. Config access pattern_get_section now coerces the section to a real dict (mirroring src.skills.loader._get_skills_section), which fixes the dict | Any return type. I kept .get() rather than attribute access (cfg.agent) because from_config must also accept the plain-dict configs the tests pass — added a comment explaining that so it reads as intentional.

New tests cover the timeout (per-call override + config default), the model-change warning, and session_id preservation. Full local gate is green — black / ruff / mypy on src/ plus pytest (26 adapter tests, 63 total). CI green on this PR too.

@PalmPalm7 PalmPalm7 requested a review from rut31337 June 3, 2026 16:20
rut31337
rut31337 previously approved these changes Jun 3, 2026

@rut31337 rut31337 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: Approve

All four issues from commit 0f5390b properly addressed:

1. Timeout/cancellation — FIXED

  • asyncio.timeout(effective_timeout) wraps the query loop
  • Config knob: agent.sdk.timeout: 300 (default 5 min; null/0 disables)
  • Per-call timeout parameter overrides config default
  • asyncio.timeout(None) is a no-op — one code path for bounded + unbounded
  • On timeout: is_error=True, informative error_message, warning logged
  • 5 tests cover per-call override, config default, and disable behavior

2. Model-change warning — FIXED

  • Logs "Model changed mid-conversation: %s -> %s" when model differs between AssistantMessages
  • Still captures final model correctly
  • Test validates warning is logged

3. session_id fallback — FIXED

  • Removed fragile or state["session_id"] pattern
  • Replaced with explicit truthy guard: only overwrites if ResultMessage has a real session_id
  • Test validates AssistantMessage session_id is preserved when ResultMessage returns None

4. Config coercion — FIXED

  • _get_section() now returns dict[str, Any] (concrete type, not | Any)
  • Coerces Dynaconf Box via .to_dict(), falls back to dict(raw) for plain dicts
  • Docstring explains why .get() is used over attribute access (test compatibility)

26 tests total (was 19), all passing. No new concerns. Ready to merge.

@rut31337

rut31337 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

PRs #22, #23, and #25 are now merged. This PR has a merge conflict with the updated main — likely in config/config.yaml where #23's skills: section was added adjacent to your agent: section.

Could you rebase against main to resolve? Should be a straightforward conflict in config/config.yaml — both sections need to coexist.

git fetch upstream
git rebase upstream/main
# resolve config/config.yaml conflict (keep both skills: and agent: sections)
git push --force-with-lease

Approved and ready to merge once the conflict is resolved.

Brings the SDK-adapter branch up to date with main (now includes the
merged skills loader rhpds#23, Skills UI rhpds#25, contributing docs rhpds#22, and the
admin/group-based-access changes).

Conflict: config/config.yaml — main added the `skills:` section (via rhpds#23)
and this branch added the `agent:`/`agent.sdk` section. Resolved by
keeping both; they are independent and complementary (agent.runtime
selects the runtime, skills.* configures SKILL.md discovery).

Local gate green on the merged tree: black, ruff, mypy (src/) + pytest
(87 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants