feat(llm): add Claude Agent SDK adapter behind agent.runtime flag (migration, phase 1)#24
Conversation
…flag Introduces src.llm package with: - AgentSdkClient: thin async wrapper around claude_agent_sdk.query() that aggregates the streamed AssistantMessage/UserMessage/ResultMessage flow into a single SdkResult (text + tool invocations + usage) - get_runtime() helper reading the new agent.runtime config flag (legacy | sdk), defaulting to legacy so existing deploys are unchanged - Lazy import of claude_agent_sdk: the adapter file is importable without the package installed; calls raise AgentSdkUnavailableError with install hint instead Tests inject a fake sdk module via sys.modules so coverage exercises the message-stream aggregation, options pass-through, error capture, and config defaults without requiring the real SDK in CI. No orchestrator changes — additive only. Phase 2 of the migration will wire AgentSdkClient into a new AgentRunner dispatcher that reads agent.runtime. See artifacts/parsec-agent-sdk-migration-plan.md.
Real-SDK regression caught during live smoke test against api.anthropic.com on 2026-05-21. The adapter was reading ResultMessage.model, but the actual claude_agent_sdk ResultMessage (>=0.2.x) has no model field at all — model lives on AssistantMessage. Result: SdkResult.model came back as None for every real call. Fix: - Capture model + session_id from AssistantMessage during the stream (last value wins, in case the conversation switches models). - ResultMessage still provides session_id as fallback + the authoritative usage/cost/num_turns. - Refactor the message loop into _ingest_assistant / _ingest_user / _ingest_result helpers to keep complete() under the mccabe complexity limit after the added branches. Added regression test test_complete_captures_model_from_assistant_even_when_result_has_none to lock in the new contract. 19 tests passing (was 18). All quality gates (ruff/black/mypy) clean.
rut31337
left a comment
There was a problem hiding this comment.
Review: Request Changes
The adapter design is solid — lazy import, feature flag defaulting to legacy, frozen result types, and 19 tests. Four issues to address:
1. High: Incorrect session_id fallback (src/llm/agent_sdk_client.py:233-234)
state["session_id"] = getattr(message, "session_id", None) or state["session_id"]The or state["session_id"] fallback is semantically incorrect — per the SDK spec, ResultMessage.session_id is a required str field, not optional. This silently keeps a stale session_id if ResultMessage.session_id were somehow empty. Drop the fallback:
state["session_id"] = getattr(message, "session_id", None)Or if defensive:
result_session = getattr(message, "session_id", None)
if result_session:
state["session_id"] = result_session2. High: Fragile model capture (src/llm/agent_sdk_client.py:220-226)
"Last assistant message wins" with no validation. If a future SDK version streams multiple AssistantMessages with different model values, the code silently overwrites. Should log a warning on inconsistency:
if msg_model:
if state["model"] and state["model"] != msg_model:
logger.warning("Model changed mid-conversation: %s -> %s", state["model"], msg_model)
state["model"] = msg_model3. High: No timeout/cancellation (src/llm/agent_sdk_client.py:165-175)
sdk.query() can run for many minutes (up to max_turns iterations with slow tools). The rest of the codebase uses timeouts consistently (orchestrator, agents, MCP connections). Without asyncio.timeout(), a hung SDK query leaks both the coroutine and the child CLI process.
Fix: Add a timeout parameter to complete():
async def complete(self, *, prompt: str, ..., timeout: float | None = None) -> SdkResult:
...
try:
if timeout:
async with asyncio.timeout(timeout):
async for message in sdk.query(prompt=prompt, options=options):
...
else:
async for message in sdk.query(prompt=prompt, options=options):
...4. Medium: Config access pattern mismatch (src/llm/agent_sdk_client.py:206-217)
The _get_section() helper uses .get() for top-level Dynaconf sections, but existing code uses attribute access (cfg.aws, cfg.anthropic). This works but is inconsistent and the dict[str, Any] | Any return type defeats type checking.
Existing pattern (e.g. src/connections/aws.py:19):
cfg = get_config()
aws_cfg = cfg.aws # attribute access for top-level
profile = aws_cfg.get("profile") # .get() for nestedConsider aligning to match.
The overall architecture (lazy import, feature flag, frozen types, clean test fixtures with fake SDK injection) is well-done. These fixes are straightforward and don't require structural changes.
…onfig fixes Addresses review on rhpds#24: - No timeout/cancellation: a hung sdk.query() leaked the coroutine and the child CLI process. complete() now takes a `timeout` (seconds), defaulting to the new `agent.sdk.timeout` config (300s; null/0 disables). The query loop is wrapped in asyncio.timeout(), and on expiry the result is marked an error instead of hanging. asyncio.timeout(None) is a no-op, so one code path covers bounded + unbounded. - Fragile model capture: warn when AssistantMessage.model changes mid-conversation instead of silently letting "last wins". - Incorrect session_id fallback: the `or state[...]` form read as falling back to a stale id; replaced with an explicit truthy guard that takes the ResultMessage id when present without clobbering the AssistantMessage id. - Config access mismatch: _get_section now coerces to a real dict (mirrors skills loader's _get_skills_section), giving a concrete dict[str, Any] return type. Documented why attribute access (cfg.agent) isn't used — from_config must also accept the plain-dict configs the tests pass. Adds tests: per-call + config-default timeout, model-change warning, session_id preservation, and the timeout config knob. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks for the review, Patrick! All four findings are addressed in 1. Incorrect session_id fallback — replaced the 2. Fragile model capture — 3. No timeout/cancellation — 4. Config access pattern — New tests cover the timeout (per-call override + config default), the model-change warning, and session_id preservation. Full local gate is green — black / ruff / mypy on |
rut31337
left a comment
There was a problem hiding this comment.
Re-review: Approve
All four issues from commit 0f5390b properly addressed:
1. Timeout/cancellation — FIXED
asyncio.timeout(effective_timeout)wraps the query loop- Config knob:
agent.sdk.timeout: 300(default 5 min;null/0disables) - Per-call
timeoutparameter overrides config default asyncio.timeout(None)is a no-op — one code path for bounded + unbounded- On timeout:
is_error=True, informativeerror_message, warning logged - 5 tests cover per-call override, config default, and disable behavior
2. Model-change warning — FIXED
- Logs
"Model changed mid-conversation: %s -> %s"when model differs between AssistantMessages - Still captures final model correctly
- Test validates warning is logged
3. session_id fallback — FIXED
- Removed fragile
or state["session_id"]pattern - Replaced with explicit truthy guard: only overwrites if ResultMessage has a real session_id
- Test validates AssistantMessage session_id is preserved when ResultMessage returns
None
4. Config coercion — FIXED
_get_section()now returnsdict[str, Any](concrete type, not| Any)- Coerces Dynaconf Box via
.to_dict(), falls back todict(raw)for plain dicts - Docstring explains why
.get()is used over attribute access (test compatibility)
26 tests total (was 19), all passing. No new concerns. Ready to merge.
|
PRs #22, #23, and #25 are now merged. This PR has a merge conflict with the updated Could you rebase against git fetch upstream
git rebase upstream/main
# resolve config/config.yaml conflict (keep both skills: and agent: sections)
git push --force-with-leaseApproved and ready to merge once the conflict is resolved. |
Brings the SDK-adapter branch up to date with main (now includes the merged skills loader rhpds#23, Skills UI rhpds#25, contributing docs rhpds#22, and the admin/group-based-access changes). Conflict: config/config.yaml — main added the `skills:` section (via rhpds#23) and this branch added the `agent:`/`agent.sdk` section. Resolved by keeping both; they are independent and complementary (agent.runtime selects the runtime, skills.* configures SKILL.md discovery). Local gate green on the merged tree: black, ruff, mypy (src/) + pytest (87 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Second of the two phase-1 migration PRs. Additive, behind a feature flag that defaults to today's behavior; independent of the skills-loader PR (#23) and mergeable in any order.
What this does
Adds
src/llm/— a thin async adapter (AgentSdkClient) aroundclaude_agent_sdk.query(), selected by a newagent.runtime: legacy | sdkflag that defaults tolegacy(today's raw-messages.create()path).claude_agent_sdkis imported lazily, so the module ships fine without the SDK installed; only.complete()triggers the import and raises a clearAgentSdkUnavailableErrorif it's missing.No behavior change: nothing calls the adapter yet — the orchestrator is untouched. This is the tested runtime scaffolding the orchestrator integration (a later phase) will build on, landed separately so it can be reviewed in isolation and can't affect production while flagged off.
What changed (6 files, +815; 2 commits)
src/llm/agent_sdk_client.pyAgentSdkClient.complete()— runs onequery(), aggregates the streamedAssistantMessage/UserMessage/ResultMessageflow intoSdkResultsrc/llm/types.pySdkResult,SdkUsage(frozen)src/llm/runtime.pyget_runtime(config)→legacy|sdk, unknown values coerced tolegacysrc/llm/__init__.pyconfig/config.yamlagent:section (runtime flag + sdk overrides)tests/test_agent_sdk_client.pysys.modules)Commits:
db7c8b7(adapter) andaf6d243(fix — see below).How to test — local
Optional — a real one-shot call (needs the SDK + an Anthropic key):
Result — local
pytest→ 19 passed (click to expand)Covers: runtime flag (default, valid, unknown→legacy, Dynaconf-like);
from_config(anthropic defaults, sdk overrides, empty config);complete()(lazy-import error path, text aggregation, tool_use↔tool_result pairing, error capture, options pass-through, omitted optionals, per-call max_turns); model/session captured fromAssistantMessage;SdkResult.succeeded.quality gates → all clean
bug caught + fixed during live API testing (commit af6d243)
A live call against the real Anthropic API failed an assertion:
result.modelcame backNone. Root cause — the realclaude_agent_sdkResultMessagehas nomodelfield (it carries usage/cost); the model lives onAssistantMessage. The adapter now capturesmodel/session_idfromAssistantMessageduring the stream, with a regression test (test_complete_captures_model_from_assistant_even_when_result_has_none). This is exactly the kind of gap pure-mock testing misses — worth knowing the adapter has now been corrected against the real SDK shape.How to test — staging (real OpenShift cluster, live company Vertex)
The standard Parsec image can't exercise this adapter:
claude_agent_sdkshells out to theclaudeCLI (Node), which the image doesn't bundle, and nothing in the orchestrator calls the adapter yet. So a purpose-built image (this branch + Node + theclaudeCLI +claude-agent-sdk) was run as a one-offJobthat callsAgentSdkClient.complete()against company Vertex AI (itpc-gcp-octo-eng-claude/global), using a GCP ADC mounted as a Secret.Result — staging
Job output → PASS, real Vertex answer (click to expand)
The SDK spawned the
claudeCLI, authenticated to Vertex, and the adapter correctly aggregated the stream intoSdkResult. The returnedclaude-sonnet-4-5-20250929(SDK normalizes the@dateform) also re-confirms the af6d243 model-capture fix works through the real SDK path.Cost data point for the migration discussion: that trivial call cost ~$0.094 — the SDK's system-prompt/tool-definition overhead vs. a bare
messages.create(). Concrete input for the Phase-2 cost/parity benchmark (Risk #4 in the plan).Notes for the reviewer
agent.runtime: legacymeans this is dead, safe code until a future PR adds the orchestrator dispatcher — intentional, so it lands low-risk.claude-agent-sdkis intentionally not inrequirements.txt(lazy import) so the image/CI don't pull Node + the CLI yet; that arrives with the integration PR as an optional extra.pytestisn't inci.ymltoday, so these 19 tests won't run at merge unless we add a step — glad to.🤖 Generated with Claude Code